Project 4: Filesystem Explorer (ls -R clone)

Build a recursive directory lister that reveals how names map to inodes and metadata.

Quick Reference

Attribute	Value
Difficulty	Intermediate
Time Estimate	Weekend
Main Programming Language	C (Alternatives: Rust, Go)
Alternative Programming Languages	Rust, Go
Coolness Level	See REFERENCE.md (Level 2)
Business Potential	See REFERENCE.md (Level 1)
Prerequisites	File I/O, recursion, permission bits
Key Topics	VFS, inodes, directory entries

1. Learning Objectives

By completing this project, you will:

Explain the difference between a filename and an inode.
Traverse directories safely without infinite loops.
Decode permission bits into human-readable form.
Build a deterministic, testable ls -lR clone.

2. All Theory Needed (Per-Concept Breakdown)

VFS, Inodes, and Directory Traversal

Fundamentals The Virtual File System (VFS) provides a uniform way to access files, regardless of the underlying filesystem. A directory contains names mapped to inode numbers; the inode holds metadata and pointers to file data. A file descriptor is an open handle, not the file name. When you list a directory, you read directory entries, then fetch metadata for each entry using a separate lookup. This explains why ls makes many metadata calls and why large directory listings can be slow. Understanding these relationships is critical to building a correct and efficient file explorer.

Deep Dive The VFS layer translates pathname operations into filesystem-specific operations. When you open a directory, the kernel creates a file object that represents that directory. The readdir family of calls reads directory entries, which include the filename and inode number (and sometimes a type hint). Importantly, directory entries do not contain full metadata like size, timestamps, or permissions. That data lives in the inode. This is why a listing tool must call a stat-like operation for each entry it wants to show in detail.

Directory traversal is deceptively complex because the filesystem tree can contain cycles via symlinks or bind mounts. If you recursively descend into directories without guarding against cycles, you can loop forever. A safe traversal strategy is to avoid following symlinks by default or to track visited inode numbers and device IDs. This is a place where the inode identity matters: two different pathnames can refer to the same inode, which is the key concept behind hard links.

Permission bits are also part of the inode, and they require careful decoding. The mode bits encode file type and permissions for user, group, and others. A long listing like -rw-r--r-- is a human-readable rendering of these bits. When you decode them, you are directly interpreting kernel metadata. Understanding how the bits map to access control rules also explains why a directory needs execute permission to be traversed and why a file can be readable but still inaccessible if you cannot execute the directory that contains it.

The VFS caches metadata in dentries and inodes. This means repeated listings may be fast even if the underlying filesystem is slow, because the kernel can reuse cached metadata. It also means that a file’s metadata can change between reads if another process modifies it. A correct listing tool must tolerate these races. If you get an entry from readdir and then stat fails because the file was deleted, you should handle it gracefully.

Finally, the VFS abstraction explains why the same user-space tool works across ext4, xfs, NFS, and procfs. The syscall interface is stable, and the VFS dispatches the correct operations. This is the deep reason behind the Unix mantra: everything is a file. The kernel makes this true by design.

How this fit on projects You will apply this concept in §3.1 for feature scope and in §4.1 for the traversal architecture. It also supports P08-mirror-filesystem-fuse.md where VFS callbacks are implemented.

Definitions & key terms

VFS: Virtual File System layer that abstracts filesystems.
Inode: File identity and metadata structure.
Dentry: Directory entry mapping names to inodes.
Stat: Metadata lookup for a file or directory.
Hard link: Multiple names pointing to the same inode.

Mental model diagram

Directory
  |
  +-- name -> inode
          |
          +-- metadata + data blocks

How it works

Open directory.
Read entries (names + inode numbers).
For each entry, query metadata.
Print details and recurse if needed.

Minimal concrete example

Entry: "report.txt" -> inode 192341
stat(inode 192341) -> size=512, mode=0644

Common misconceptions

“Filenames are stored inside the file.” They are stored in directory entries.
“readdir provides size and permissions.” It does not; you must stat.
“Directories are special in userspace.” They are files with a specific format.

Check-your-understanding questions

Why must a long listing call stat for each file?
What does a hard link actually link to?
Why can recursive traversal loop without safeguards?
How do permission bits affect directory traversal?

Check-your-understanding answers

Directory entries do not contain metadata like size or permissions.
It links to the inode, not the name.
Symlinks or multiple paths can create cycles.
Execute permission on a directory is required to traverse it.

Real-world applications

Building file management tools.
Diagnosing permission and ownership issues.
Understanding performance of large directories.

Where you’ll apply it

See §3.1 What You Will Build and §4.2 Key Components.
Also used in: P08-mirror-filesystem-fuse.md

References

VFS documentation: https://docs.kernel.org/filesystems/vfs.html
“The Linux Programming Interface” - file metadata chapters

Key insights Directories map names to inodes; inodes define file identity.

Summary A correct ls -R clone is a tour of VFS, dentries, and inode metadata.

Homework/Exercises to practice the concept

Create a hard link and compare inode numbers.
Remove a filename while keeping a file open and observe disk usage.

Solutions to the homework/exercises

The inode numbers match, proving both names point to the same inode.
The file remains until the last handle closes.

3. Project Specification

3.1 What You Will Build

A recursive directory lister that prints permissions, ownership, size, and timestamps. It supports -l and -R options and avoids following symlinks by default.

3.2 Functional Requirements

List directory contents with basic file names.
Long listing with permissions, ownership, and size.
Recursive mode that descends into subdirectories.

3.3 Non-Functional Requirements

Performance: Reasonable performance on large directories.
Reliability: Handle files that disappear mid-scan.
Usability: Output similar to ls -lR.

3.4 Example Usage / Output

$ ./myls -lR .
./
-rw-r--r-- 1 user staff  512 Jan 02 12:01 main.c
./subdir
-rw-r--r-- 1 user staff  128 Jan 02 12:03 notes.txt

3.5 Data Formats / Schemas / Protocols

Permission string format: rwxr-xr-x.
Timestamp format: MMM DD HH:MM.

3.6 Edge Cases

Symlink loops.
Permission denied on a subdirectory.
File deleted after reading directory entry.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

Build in project-root.
Run ./myls -lR <path>.

3.7.2 Golden Path Demo (Deterministic)

Use a fixed sample directory tree with known contents.

3.7.3 If CLI: Exact terminal transcript

$ ./myls -lR sample/
sample/
-rw-r--r-- 1 user staff  512 Jan 02 12:01 file.txt
sample/sub/
-rw-r--r-- 1 user staff  128 Jan 02 12:02 note.txt
# exit code: 0

Failure demo (deterministic):

$ ./myls -lR /root
error: permission denied
# exit code: 13

Exit codes:

0 success
13 permission error

4. Solution Architecture

4.1 High-Level Design

Path -> open dir -> read entries -> stat -> print -> recurse

4.2 Key Components

Component	Responsibility	Key Decisions
Traverser	Walk directories recursively	Skip symlinks by default
Formatter	Render permissions and timestamps	Match ls conventions
Error handler	Surface permission errors	Continue traversal

4.4 Data Structures (No Full Code)

Entry record: name, inode, type hint, stat info.
Traversal stack: list of directories to visit.

4.4 Algorithm Overview

Key Algorithm: recursive traversal

Read directory entries.
For each entry, fetch metadata.
Print details.
Recurse into subdirectories.

Complexity Analysis:

Time: O(n) for n entries.
Space: O(d) for recursion depth.

5. Implementation Guide

5.1 Development Environment Setup

# Install a C compiler and make

5.2 Project Structure

project-root/
├── src/
│   ├── myls.c
│   └── format.c
├── tests/
│   └── ls_tests.sh
└── README.md

5.3 The Core Question You’re Answering

“Where is the filename stored, and where is the metadata stored?”

5.4 Concepts You Must Understand First

Inodes vs directory entries
- What lives where?
- Book Reference: “The Linux Programming Interface” - Ch. 15-18
Permission decoding
- How to map mode bits to rwx strings.
- Book Reference: “The Linux Programming Interface” - file attributes

5.5 Questions to Guide Your Design

How will you avoid infinite recursion?
How will you handle permission errors without stopping the scan?

5.6 Thinking Exercise

Metadata Map

Pick a file and list which attributes come from the inode.

5.7 The Interview Questions They’ll Ask

“What is an inode?”
“What is the difference between hard and soft links?”
“Why does ls need to call stat?”
“How do permissions affect directory traversal?”

5.8 Hints in Layers

Hint 1: Start with non-recursive Get ls -l output for a single directory.

Hint 2: Add recursion Descend into subdirectories and track visited paths.

Hint 3: Permissions Check mode bits carefully and format correctly.

Hint 4: Debugging Compare output with ls -lR on the same tree.

5.9 Books That Will Help

Topic	Book	Chapter
File metadata	“The Linux Programming Interface”	Ch. 15
Directories	“The Linux Programming Interface”	Ch. 18

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Goals:

List a single directory.

Tasks:

Read directory entries.
Print names.

Checkpoint: ./myls . works.

Phase 2: Core Functionality (2 days)

Goals:

Add long listing and recursion.

Tasks:

Fetch metadata for each entry.
Recurse into subdirectories.

Checkpoint: ./myls -lR sample/ matches expected output.

Phase 3: Polish & Edge Cases (1 day)

Goals:

Handle errors gracefully.

Tasks:

Skip unreadable directories.
Avoid symlink loops.

Checkpoint: Traversal never hangs.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Symlink policy	Follow vs skip	Skip by default	Avoid cycles
Formatting	Exact ls vs simplified	Near-ls	Easier validation

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Permission decoding	Mode bit mapping
Integration Tests	Directory walk	sample tree
Edge Case Tests	Permission denied	protected dir

6.2 Critical Test Cases

Simple directory: correct output order and metadata.
Symlink loop: traversal does not hang.
Permission error: error printed, scan continues.

6.3 Test Data

Sample tree with a symlink and a protected directory

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
No stat calls	Missing metadata	Call stat for each entry
Symlink loops	Infinite recursion	Skip or track visited inodes
Wrong permissions	Incorrect rwx	Verify bit mapping

7.2 Debugging Strategies

Compare with ls: diff outputs for the same directory.
Print inode numbers: confirm hard links.

7.3 Performance Traps

Stat calls dominate runtime on network filesystems; minimize redundant stats.

8. Extensions & Challenges

8.1 Beginner Extensions

Add -a for hidden files.
Add -h for human-readable sizes.

8.2 Intermediate Extensions

Sort by size or time.
Add color output by file type.

8.3 Advanced Extensions

Implement find-style filtering.
Add JSON output.

9. Real-World Connections

9.1 Industry Applications

File scanners: indexing and backup tools.
Security auditing: permission and ownership checks.

coreutils ls: https://www.gnu.org/software/coreutils/ - reference implementation.
findutils: https://www.gnu.org/software/findutils/ - traversal toolkit.

9.3 Interview Relevance

Understanding inodes and permissions is a classic Linux interview topic.

10. Resources

10.1 Essential Reading

“The Linux Programming Interface” - file and directory chapters
VFS documentation - docs.kernel.org

10.2 Video Resources

“Unix filesystems explained” - lectures (search title)

10.3 Tools & Documentation

man7: https://man7.org/linux/man-pages/man2/stat.2.html
VFS docs: https://docs.kernel.org/filesystems/vfs.html

11. Self-Assessment Checklist

11.1 Understanding

I can explain inode vs directory entry
I can decode permission bits
I understand why stat is separate from readdir

11.2 Implementation

All functional requirements are met
Traversal avoids cycles
Output matches expected format

11.3 Growth

I can explain this project in an interview
I documented lessons learned
I can propose an extension

12. Submission / Completion Criteria

Minimum Viable Completion:

List directory contents recursively
Print correct permissions and sizes
Handle permission errors

Full Completion:

All minimum criteria plus:
Support -a and -h options
Deterministic output on sample tree

Excellence (Going Above & Beyond):

Sorting, filtering, and JSON output
Performance optimizations for large trees

Project 4: Filesystem Explorer (ls -R clone)

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

VFS, Inodes, and Directory Traversal

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: Exact terminal transcript

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Phase 2: Core Functionality (2 days)

Phase 3: Polish & Edge Cases (1 day)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria