Project 10: Parallel Directory Traversal

Build a fast, parallel directory walker that respects ignore rules, detects binaries, and feeds file paths to a search engine.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 2 weeks
Main Programming Language Rust (Alternatives: C++, Go)
Alternative Programming Languages C++, Go
Coolness Level Level 3: Clever
Business Potential Level 3: Infra / Dev tools
Prerequisites concurrency basics, filesystem APIs
Key Topics work queues, ignore rules, file filtering, traversal order

1. Learning Objectives

By completing this project, you will:

  1. Implement a parallel directory walker with a work queue.
  2. Respect .gitignore and custom ignore patterns.
  3. Detect binary files and skip them by default.
  4. Balance work across threads without contention.
  5. Produce deterministic traversal order for testing.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Parallel File System Traversal

Fundamentals

Directory traversal is the process of walking a directory tree and discovering files. A naive traversal is depth-first or breadth-first, single-threaded. For large repositories, this becomes a bottleneck: the search engine may be fast, but it starves for work because path discovery is slow. Parallel traversal uses a shared work queue: each worker thread pops a directory path, enumerates its entries, and pushes subdirectories back into the queue while emitting file paths to a downstream channel.

Correctness also requires filtering: ignore rules, hidden files, file types, and binary detection. These filters reduce work and improve throughput. A good traversal system is not just fast, it is precise about what it includes.

Deep Dive into the concept

The work queue is the core concurrency primitive. It can be implemented as a mutex-protected queue, a lock-free queue, or a work-stealing deque per worker. Work stealing improves scalability because each worker primarily works locally but can steal from others when idle. Rust’s crossbeam deque and Go’s worker pools are typical choices.

Ignore rules add complexity: .gitignore patterns apply hierarchically, and some patterns can re-include files. Your walker must maintain a stack of ignore matchers as it descends directories. Binary detection is a pragmatic heuristic: read a small chunk and classify as binary if it contains NUL bytes. This should be fast and optional, because some files are binary but still contain text data.

Traversal order affects determinism. For testing, you may want lexicographic ordering of emitted paths. For performance, you may emit as soon as discovered. You can support both by buffering outputs when --deterministic is enabled.

How this fits on projects

This walker is a core component of the final search tool (P15) and influences I/O performance in P11.

Definitions & key terms

  • work queue -> shared list of directories to process
  • work stealing -> idle workers steal tasks from others
  • ignore rules -> patterns that exclude or include paths
  • binary detection -> heuristic to skip non-text files

Mental model diagram (ASCII)

Queue: [dirA, dirB]
Worker1 -> dirA -> push dirA/sub1, emit files
Worker2 -> dirB -> push dirB/sub2, emit files

How it works (step-by-step)

  1. Seed queue with root directory.
  2. Workers pop directories and enumerate entries.
  3. Apply ignore rules and filters.
  4. Push subdirectories to queue and emit files.
  5. Continue until queue empty and workers idle.

Minimal concrete example

while let Some(dir) = queue.pop() {
    for entry in read_dir(dir) { /* filter + enqueue */ }
}

Common misconceptions

  • Misconception: Traversal is always I/O bound. Correction: Ignore matching and metadata calls can be CPU heavy.
  • Misconception: Deterministic order is free. Correction: It requires sorting and buffering, which costs time.

Check-your-understanding questions

  1. Why does work stealing scale better than a single queue?
  2. How do .gitignore rules propagate into subdirectories?
  3. Why skip binary files by default?

Check-your-understanding answers

  1. It reduces contention and keeps workers busy.
  2. Rules apply hierarchically; each directory adds patterns.
  3. Binary files rarely contain meaningful text matches and are expensive to scan.

Real-world applications

  • ripgrep, fd, ag, find replacements
  • build systems and code indexing

Where you’ll apply it

References

  • ripgrep ignore crate
  • bfs directory traversal blog (tavianator)

Key insights

Traversal speed often dominates search speed; parallelism and ignore rules are critical.

Summary

A parallel walker feeds your search engine efficiently while respecting user expectations about ignored files.

Homework/Exercises to practice the concept

  1. Implement a single-threaded walker and benchmark it.
  2. Add a second thread and measure throughput.

Solutions to the homework/exercises

  1. Use recursive DFS and collect file paths.
  2. Introduce a queue and split work across threads.

3. Project Specification

3.1 What You Will Build

A CLI tool parwalk that recursively walks directories in parallel, applies ignore rules, detects binaries, and outputs file paths to stdout or a JSON list.

3.2 Functional Requirements

  1. Parallel traversal: configurable worker count.
  2. Ignore rules: support .gitignore and custom ignore file.
  3. Hidden files: skip by default, include with --hidden.
  4. Binary detection: skip binary by default, include with --binary.
  5. Deterministic mode: optional sorted output.

3.3 Non-Functional Requirements

  • Performance: faster than find on large repos.
  • Reliability: handle permission errors gracefully.
  • Usability: clear stats (files visited, skipped).

3.4 Example Usage / Output

$ ./parwalk --threads 8 --hidden ~/code
/home/user/code/README.md
/home/user/code/src/main.rs

3.5 Data Formats / Schemas / Protocols

JSON output example:

{"files":["/path/a","/path/b"],"skipped":120}

3.6 Edge Cases

  • Symlink loops
  • Permission denied directories
  • Very deep directory trees

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

cargo run -- --threads 8 --hidden ~/code

3.7.2 Golden Path Demo (Deterministic)

Use --deterministic to produce sorted output for tests.

3.7.3 CLI Transcript (Success + Failure)

$ ./parwalk --threads 4 fixtures/tree
fixtures/tree/a.txt
fixtures/tree/sub/b.txt
exit code: 0

$ ./parwalk /root/secret
warning: permission denied: /root/secret
exit code: 0

4. Solution Architecture

4.1 High-Level Design

+-----------+   +-----------+   +------------+   +-----------+
| CLI Parse |-->| Work Queue|-->| Dir Worker |-->| Output    |
+-----------+   +-----------+   +------------+   +-----------+

4.2 Key Components

| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Queue | distribute directories | work-stealing deque | | Ignore engine | match patterns | per-dir stack | | Binary detector | sample bytes | NUL heuristic | | Output | emit paths | optional sorting |

4.3 Data Structures (No Full Code)

struct Task { path: PathBuf, ignore: IgnoreState }

4.4 Algorithm Overview

  1. Seed queue with root.
  2. Workers pop tasks, read directory entries.
  3. Filter entries with ignore rules.
  4. Enqueue subdirectories, emit files.

Complexity Analysis

  • Time: O(number of entries)
  • Space: O(depth of traversal + queue size)

5. Implementation Guide

5.1 Development Environment Setup

cargo build --release

5.2 Project Structure

parwalk/
├── src/
│   ├── main.rs
│   ├── queue.rs
│   ├── ignore.rs
│   ├── binary.rs
│   └── output.rs
└── fixtures/

5.3 The Core Question You’re Answering

“How do I traverse huge directory trees fast without scanning junk files?”

5.4 Concepts You Must Understand First

  1. Work queues and worker pools.
  2. Hierarchical ignore rules.
  3. Binary detection heuristics.

5.5 Questions to Guide Your Design

  1. How will you prevent symlink loops?
  2. How will you handle permission errors without stopping?
  3. How will you keep output deterministic for tests?

5.6 Thinking Exercise

What happens if two workers scan the same directory? How do you prevent it?

5.7 The Interview Questions They’ll Ask

  1. Why is .gitignore handling tricky?
  2. How do you prevent contention in a parallel walker?

5.8 Hints in Layers

Hint 1: Start with a single-threaded walker. Hint 2: Add a queue with multiple worker threads. Hint 3: Add ignore rule stacking.

5.9 Books That Will Help

| Topic | Book | Chapter | |——-|——|———| | Filesystems | “The Linux Programming Interface” | directory APIs | | Concurrency | “Programming Rust” | thread pools |

5.10 Implementation Phases

Phase 1: Single Thread (3 days)

  • Implement traversal and filtering.
  • Checkpoint: correct output on fixtures.

Phase 2: Parallelism (4 days)

  • Add worker pool and queue.
  • Checkpoint: speedup on large repo.

Phase 3: Ignore + Binary (3 days)

  • Add ignore rules and binary detection.
  • Checkpoint: matches rg --files on fixtures.

5.11 Key Implementation Decisions

| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Queue | mutex queue vs work stealing | work stealing | better scaling | | Output order | immediate vs sorted | immediate default | faster |


6. Testing Strategy

6.1 Test Categories

| Category | Purpose | Examples | |———-|———|———-| | Unit | ignore matching | .gitignore cases | | Integration | traversal output | fixture tree | | Stress | many files | generated directory |

6.2 Critical Test Cases

  1. Ignore rules exclude node_modules.
  2. Binary files skipped by default.
  3. Permission errors do not abort traversal.

6.3 Test Data

fixtures/tree/
  a.txt
  sub/ignored.bin

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

| Pitfall | Symptom | Solution | |———|———|———-| | Deadlocks | traversal hangs | avoid holding locks while I/O | | Duplicate paths | repeated output | ensure tasks unique | | Ignored files included | wrong results | test ignore hierarchy |

7.2 Debugging Strategies

  • Log worker activity on small trees.
  • Compare output with rg --files.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add --max-depth flag.
  • Add --type filters (files only, dirs only).

8.2 Intermediate Extensions

  • Add file size filters.
  • Cache ignore matcher per directory.

8.3 Advanced Extensions

  • Use git index for faster traversal inside repos.
  • Add adaptive thread scaling based on I/O.

9. Real-World Connections

9.1 Industry Applications

  • Code search tools and build systems
  • Backup and sync tools
  • ripgrep ignore crate
  • fd and bfs

9.3 Interview Relevance

  • Parallel traversal and work-stealing design

10. Resources

10.1 Essential Reading

  • ripgrep ignore documentation
  • bfs performance notes

10.2 Tools & Documentation

  • strace to inspect syscalls
  • hyperfine for traversal benchmarks

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain work-stealing traversal.
  • I can explain ignore rule stacking.

11.2 Implementation

  • Output matches rg --files on fixtures.
  • Deterministic mode works.

11.3 Growth

  • I can identify traversal bottlenecks with profiling.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Parallel walker with basic filtering

Full Completion:

  • .gitignore + binary detection

Excellence (Going Above & Beyond):

  • Git index traversal optimization
  • Adaptive thread scaling