Project 4: Filesystem Explorer (ls -R clone)
Build a recursive directory lister that reveals how names map to inodes and metadata.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Intermediate |
| Time Estimate | Weekend |
| Main Programming Language | C (Alternatives: Rust, Go) |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | See REFERENCE.md (Level 2) |
| Business Potential | See REFERENCE.md (Level 1) |
| Prerequisites | File I/O, recursion, permission bits |
| Key Topics | VFS, inodes, directory entries |
1. Learning Objectives
By completing this project, you will:
- Explain the difference between a filename and an inode.
- Traverse directories safely without infinite loops.
- Decode permission bits into human-readable form.
- Build a deterministic, testable
ls -lRclone.
2. All Theory Needed (Per-Concept Breakdown)
VFS, Inodes, and Directory Traversal
Fundamentals
The Virtual File System (VFS) provides a uniform way to access files, regardless of the underlying filesystem. A directory contains names mapped to inode numbers; the inode holds metadata and pointers to file data. A file descriptor is an open handle, not the file name. When you list a directory, you read directory entries, then fetch metadata for each entry using a separate lookup. This explains why ls makes many metadata calls and why large directory listings can be slow. Understanding these relationships is critical to building a correct and efficient file explorer.
Deep Dive
The VFS layer translates pathname operations into filesystem-specific operations. When you open a directory, the kernel creates a file object that represents that directory. The readdir family of calls reads directory entries, which include the filename and inode number (and sometimes a type hint). Importantly, directory entries do not contain full metadata like size, timestamps, or permissions. That data lives in the inode. This is why a listing tool must call a stat-like operation for each entry it wants to show in detail.
Directory traversal is deceptively complex because the filesystem tree can contain cycles via symlinks or bind mounts. If you recursively descend into directories without guarding against cycles, you can loop forever. A safe traversal strategy is to avoid following symlinks by default or to track visited inode numbers and device IDs. This is a place where the inode identity matters: two different pathnames can refer to the same inode, which is the key concept behind hard links.
Permission bits are also part of the inode, and they require careful decoding. The mode bits encode file type and permissions for user, group, and others. A long listing like -rw-r--r-- is a human-readable rendering of these bits. When you decode them, you are directly interpreting kernel metadata. Understanding how the bits map to access control rules also explains why a directory needs execute permission to be traversed and why a file can be readable but still inaccessible if you cannot execute the directory that contains it.
The VFS caches metadata in dentries and inodes. This means repeated listings may be fast even if the underlying filesystem is slow, because the kernel can reuse cached metadata. It also means that a file’s metadata can change between reads if another process modifies it. A correct listing tool must tolerate these races. If you get an entry from readdir and then stat fails because the file was deleted, you should handle it gracefully.
Finally, the VFS abstraction explains why the same user-space tool works across ext4, xfs, NFS, and procfs. The syscall interface is stable, and the VFS dispatches the correct operations. This is the deep reason behind the Unix mantra: everything is a file. The kernel makes this true by design.
How this fit on projects You will apply this concept in §3.1 for feature scope and in §4.1 for the traversal architecture. It also supports P08-mirror-filesystem-fuse.md where VFS callbacks are implemented.
Definitions & key terms
- VFS: Virtual File System layer that abstracts filesystems.
- Inode: File identity and metadata structure.
- Dentry: Directory entry mapping names to inodes.
- Stat: Metadata lookup for a file or directory.
- Hard link: Multiple names pointing to the same inode.
Mental model diagram
Directory
|
+-- name -> inode
|
+-- metadata + data blocks
How it works
- Open directory.
- Read entries (names + inode numbers).
- For each entry, query metadata.
- Print details and recurse if needed.
Minimal concrete example
Entry: "report.txt" -> inode 192341
stat(inode 192341) -> size=512, mode=0644
Common misconceptions
- “Filenames are stored inside the file.” They are stored in directory entries.
- “readdir provides size and permissions.” It does not; you must stat.
- “Directories are special in userspace.” They are files with a specific format.
Check-your-understanding questions
- Why must a long listing call stat for each file?
- What does a hard link actually link to?
- Why can recursive traversal loop without safeguards?
- How do permission bits affect directory traversal?
Check-your-understanding answers
- Directory entries do not contain metadata like size or permissions.
- It links to the inode, not the name.
- Symlinks or multiple paths can create cycles.
- Execute permission on a directory is required to traverse it.
Real-world applications
- Building file management tools.
- Diagnosing permission and ownership issues.
- Understanding performance of large directories.
Where you’ll apply it
- See §3.1 What You Will Build and §4.2 Key Components.
- Also used in: P08-mirror-filesystem-fuse.md
References
- VFS documentation: https://docs.kernel.org/filesystems/vfs.html
- “The Linux Programming Interface” - file metadata chapters
Key insights Directories map names to inodes; inodes define file identity.
Summary
A correct ls -R clone is a tour of VFS, dentries, and inode metadata.
Homework/Exercises to practice the concept
- Create a hard link and compare inode numbers.
- Remove a filename while keeping a file open and observe disk usage.
Solutions to the homework/exercises
- The inode numbers match, proving both names point to the same inode.
- The file remains until the last handle closes.
3. Project Specification
3.1 What You Will Build
A recursive directory lister that prints permissions, ownership, size, and timestamps. It supports -l and -R options and avoids following symlinks by default.
3.2 Functional Requirements
- List directory contents with basic file names.
- Long listing with permissions, ownership, and size.
- Recursive mode that descends into subdirectories.
3.3 Non-Functional Requirements
- Performance: Reasonable performance on large directories.
- Reliability: Handle files that disappear mid-scan.
- Usability: Output similar to
ls -lR.
3.4 Example Usage / Output
$ ./myls -lR .
./
-rw-r--r-- 1 user staff 512 Jan 02 12:01 main.c
./subdir
-rw-r--r-- 1 user staff 128 Jan 02 12:03 notes.txt
3.5 Data Formats / Schemas / Protocols
- Permission string format:
rwxr-xr-x. - Timestamp format:
MMM DD HH:MM.
3.6 Edge Cases
- Symlink loops.
- Permission denied on a subdirectory.
- File deleted after reading directory entry.
3.7 Real World Outcome
3.7.1 How to Run (Copy/Paste)
- Build in
project-root. - Run
./myls -lR <path>.
3.7.2 Golden Path Demo (Deterministic)
Use a fixed sample directory tree with known contents.
3.7.3 If CLI: Exact terminal transcript
$ ./myls -lR sample/
sample/
-rw-r--r-- 1 user staff 512 Jan 02 12:01 file.txt
sample/sub/
-rw-r--r-- 1 user staff 128 Jan 02 12:02 note.txt
# exit code: 0
Failure demo (deterministic):
$ ./myls -lR /root
error: permission denied
# exit code: 13
Exit codes:
- 0 success
- 13 permission error
4. Solution Architecture
4.1 High-Level Design
Path -> open dir -> read entries -> stat -> print -> recurse
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Traverser | Walk directories recursively | Skip symlinks by default |
| Formatter | Render permissions and timestamps | Match ls conventions |
| Error handler | Surface permission errors | Continue traversal |
4.4 Data Structures (No Full Code)
- Entry record: name, inode, type hint, stat info.
- Traversal stack: list of directories to visit.
4.4 Algorithm Overview
Key Algorithm: recursive traversal
- Read directory entries.
- For each entry, fetch metadata.
- Print details.
- Recurse into subdirectories.
Complexity Analysis:
- Time: O(n) for n entries.
- Space: O(d) for recursion depth.
5. Implementation Guide
5.1 Development Environment Setup
# Install a C compiler and make
5.2 Project Structure
project-root/
├── src/
│ ├── myls.c
│ └── format.c
├── tests/
│ └── ls_tests.sh
└── README.md
5.3 The Core Question You’re Answering
“Where is the filename stored, and where is the metadata stored?”
5.4 Concepts You Must Understand First
- Inodes vs directory entries
- What lives where?
- Book Reference: “The Linux Programming Interface” - Ch. 15-18
- Permission decoding
- How to map mode bits to rwx strings.
- Book Reference: “The Linux Programming Interface” - file attributes
5.5 Questions to Guide Your Design
- How will you avoid infinite recursion?
- How will you handle permission errors without stopping the scan?
5.6 Thinking Exercise
Metadata Map
Pick a file and list which attributes come from the inode.
5.7 The Interview Questions They’ll Ask
- “What is an inode?”
- “What is the difference between hard and soft links?”
- “Why does
lsneed to call stat?” - “How do permissions affect directory traversal?”
5.8 Hints in Layers
Hint 1: Start with non-recursive
Get ls -l output for a single directory.
Hint 2: Add recursion Descend into subdirectories and track visited paths.
Hint 3: Permissions Check mode bits carefully and format correctly.
Hint 4: Debugging
Compare output with ls -lR on the same tree.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| File metadata | “The Linux Programming Interface” | Ch. 15 |
| Directories | “The Linux Programming Interface” | Ch. 18 |
5.10 Implementation Phases
Phase 1: Foundation (1 day)
Goals:
- List a single directory.
Tasks:
- Read directory entries.
- Print names.
Checkpoint: ./myls . works.
Phase 2: Core Functionality (2 days)
Goals:
- Add long listing and recursion.
Tasks:
- Fetch metadata for each entry.
- Recurse into subdirectories.
Checkpoint: ./myls -lR sample/ matches expected output.
Phase 3: Polish & Edge Cases (1 day)
Goals:
- Handle errors gracefully.
Tasks:
- Skip unreadable directories.
- Avoid symlink loops.
Checkpoint: Traversal never hangs.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Symlink policy | Follow vs skip | Skip by default | Avoid cycles |
| Formatting | Exact ls vs simplified | Near-ls | Easier validation |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Permission decoding | Mode bit mapping |
| Integration Tests | Directory walk | sample tree |
| Edge Case Tests | Permission denied | protected dir |
6.2 Critical Test Cases
- Simple directory: correct output order and metadata.
- Symlink loop: traversal does not hang.
- Permission error: error printed, scan continues.
6.3 Test Data
Sample tree with a symlink and a protected directory
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| No stat calls | Missing metadata | Call stat for each entry |
| Symlink loops | Infinite recursion | Skip or track visited inodes |
| Wrong permissions | Incorrect rwx | Verify bit mapping |
7.2 Debugging Strategies
- Compare with ls: diff outputs for the same directory.
- Print inode numbers: confirm hard links.
7.3 Performance Traps
Stat calls dominate runtime on network filesystems; minimize redundant stats.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add
-afor hidden files. - Add
-hfor human-readable sizes.
8.2 Intermediate Extensions
- Sort by size or time.
- Add color output by file type.
8.3 Advanced Extensions
- Implement
find-style filtering. - Add JSON output.
9. Real-World Connections
9.1 Industry Applications
- File scanners: indexing and backup tools.
- Security auditing: permission and ownership checks.
9.2 Related Open Source Projects
- coreutils ls: https://www.gnu.org/software/coreutils/ - reference implementation.
- findutils: https://www.gnu.org/software/findutils/ - traversal toolkit.
9.3 Interview Relevance
Understanding inodes and permissions is a classic Linux interview topic.
10. Resources
10.1 Essential Reading
- “The Linux Programming Interface” - file and directory chapters
- VFS documentation - docs.kernel.org
10.2 Video Resources
- “Unix filesystems explained” - lectures (search title)
10.3 Tools & Documentation
- man7: https://man7.org/linux/man-pages/man2/stat.2.html
- VFS docs: https://docs.kernel.org/filesystems/vfs.html
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain inode vs directory entry
- I can decode permission bits
- I understand why stat is separate from readdir
11.2 Implementation
- All functional requirements are met
- Traversal avoids cycles
- Output matches expected format
11.3 Growth
- I can explain this project in an interview
- I documented lessons learned
- I can propose an extension
12. Submission / Completion Criteria
Minimum Viable Completion:
- List directory contents recursively
- Print correct permissions and sizes
- Handle permission errors
Full Completion:
- All minimum criteria plus:
- Support
-aand-hoptions - Deterministic output on sample tree
Excellence (Going Above & Beyond):
- Sorting, filtering, and JSON output
- Performance optimizations for large trees