Project 4: Filesystem Explorer (ls -R clone)

Build a recursive directory lister that reveals how names map to inodes and metadata.

Quick Reference

Attribute Value
Difficulty Intermediate
Time Estimate Weekend
Main Programming Language C (Alternatives: Rust, Go)
Alternative Programming Languages Rust, Go
Coolness Level See REFERENCE.md (Level 2)
Business Potential See REFERENCE.md (Level 1)
Prerequisites File I/O, recursion, permission bits
Key Topics VFS, inodes, directory entries

1. Learning Objectives

By completing this project, you will:

  1. Explain the difference between a filename and an inode.
  2. Traverse directories safely without infinite loops.
  3. Decode permission bits into human-readable form.
  4. Build a deterministic, testable ls -lR clone.

2. All Theory Needed (Per-Concept Breakdown)

VFS, Inodes, and Directory Traversal

Fundamentals The Virtual File System (VFS) provides a uniform way to access files, regardless of the underlying filesystem. A directory contains names mapped to inode numbers; the inode holds metadata and pointers to file data. A file descriptor is an open handle, not the file name. When you list a directory, you read directory entries, then fetch metadata for each entry using a separate lookup. This explains why ls makes many metadata calls and why large directory listings can be slow. Understanding these relationships is critical to building a correct and efficient file explorer.

Deep Dive The VFS layer translates pathname operations into filesystem-specific operations. When you open a directory, the kernel creates a file object that represents that directory. The readdir family of calls reads directory entries, which include the filename and inode number (and sometimes a type hint). Importantly, directory entries do not contain full metadata like size, timestamps, or permissions. That data lives in the inode. This is why a listing tool must call a stat-like operation for each entry it wants to show in detail.

Directory traversal is deceptively complex because the filesystem tree can contain cycles via symlinks or bind mounts. If you recursively descend into directories without guarding against cycles, you can loop forever. A safe traversal strategy is to avoid following symlinks by default or to track visited inode numbers and device IDs. This is a place where the inode identity matters: two different pathnames can refer to the same inode, which is the key concept behind hard links.

Permission bits are also part of the inode, and they require careful decoding. The mode bits encode file type and permissions for user, group, and others. A long listing like -rw-r--r-- is a human-readable rendering of these bits. When you decode them, you are directly interpreting kernel metadata. Understanding how the bits map to access control rules also explains why a directory needs execute permission to be traversed and why a file can be readable but still inaccessible if you cannot execute the directory that contains it.

The VFS caches metadata in dentries and inodes. This means repeated listings may be fast even if the underlying filesystem is slow, because the kernel can reuse cached metadata. It also means that a file’s metadata can change between reads if another process modifies it. A correct listing tool must tolerate these races. If you get an entry from readdir and then stat fails because the file was deleted, you should handle it gracefully.

Finally, the VFS abstraction explains why the same user-space tool works across ext4, xfs, NFS, and procfs. The syscall interface is stable, and the VFS dispatches the correct operations. This is the deep reason behind the Unix mantra: everything is a file. The kernel makes this true by design.

How this fit on projects You will apply this concept in §3.1 for feature scope and in §4.1 for the traversal architecture. It also supports P08-mirror-filesystem-fuse.md where VFS callbacks are implemented.

Definitions & key terms

  • VFS: Virtual File System layer that abstracts filesystems.
  • Inode: File identity and metadata structure.
  • Dentry: Directory entry mapping names to inodes.
  • Stat: Metadata lookup for a file or directory.
  • Hard link: Multiple names pointing to the same inode.

Mental model diagram

Directory
  |
  +-- name -> inode
          |
          +-- metadata + data blocks

How it works

  1. Open directory.
  2. Read entries (names + inode numbers).
  3. For each entry, query metadata.
  4. Print details and recurse if needed.

Minimal concrete example

Entry: "report.txt" -> inode 192341
stat(inode 192341) -> size=512, mode=0644

Common misconceptions

  • “Filenames are stored inside the file.” They are stored in directory entries.
  • “readdir provides size and permissions.” It does not; you must stat.
  • “Directories are special in userspace.” They are files with a specific format.

Check-your-understanding questions

  1. Why must a long listing call stat for each file?
  2. What does a hard link actually link to?
  3. Why can recursive traversal loop without safeguards?
  4. How do permission bits affect directory traversal?

Check-your-understanding answers

  1. Directory entries do not contain metadata like size or permissions.
  2. It links to the inode, not the name.
  3. Symlinks or multiple paths can create cycles.
  4. Execute permission on a directory is required to traverse it.

Real-world applications

  • Building file management tools.
  • Diagnosing permission and ownership issues.
  • Understanding performance of large directories.

Where you’ll apply it

References

  • VFS documentation: https://docs.kernel.org/filesystems/vfs.html
  • “The Linux Programming Interface” - file metadata chapters

Key insights Directories map names to inodes; inodes define file identity.

Summary A correct ls -R clone is a tour of VFS, dentries, and inode metadata.

Homework/Exercises to practice the concept

  1. Create a hard link and compare inode numbers.
  2. Remove a filename while keeping a file open and observe disk usage.

Solutions to the homework/exercises

  1. The inode numbers match, proving both names point to the same inode.
  2. The file remains until the last handle closes.

3. Project Specification

3.1 What You Will Build

A recursive directory lister that prints permissions, ownership, size, and timestamps. It supports -l and -R options and avoids following symlinks by default.

3.2 Functional Requirements

  1. List directory contents with basic file names.
  2. Long listing with permissions, ownership, and size.
  3. Recursive mode that descends into subdirectories.

3.3 Non-Functional Requirements

  • Performance: Reasonable performance on large directories.
  • Reliability: Handle files that disappear mid-scan.
  • Usability: Output similar to ls -lR.

3.4 Example Usage / Output

$ ./myls -lR .
./
-rw-r--r-- 1 user staff  512 Jan 02 12:01 main.c
./subdir
-rw-r--r-- 1 user staff  128 Jan 02 12:03 notes.txt

3.5 Data Formats / Schemas / Protocols

  • Permission string format: rwxr-xr-x.
  • Timestamp format: MMM DD HH:MM.

3.6 Edge Cases

  • Symlink loops.
  • Permission denied on a subdirectory.
  • File deleted after reading directory entry.

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

  • Build in project-root.
  • Run ./myls -lR <path>.

3.7.2 Golden Path Demo (Deterministic)

Use a fixed sample directory tree with known contents.

3.7.3 If CLI: Exact terminal transcript

$ ./myls -lR sample/
sample/
-rw-r--r-- 1 user staff  512 Jan 02 12:01 file.txt
sample/sub/
-rw-r--r-- 1 user staff  128 Jan 02 12:02 note.txt
# exit code: 0

Failure demo (deterministic):

$ ./myls -lR /root
error: permission denied
# exit code: 13

Exit codes:

  • 0 success
  • 13 permission error

4. Solution Architecture

4.1 High-Level Design

Path -> open dir -> read entries -> stat -> print -> recurse

4.2 Key Components

Component Responsibility Key Decisions
Traverser Walk directories recursively Skip symlinks by default
Formatter Render permissions and timestamps Match ls conventions
Error handler Surface permission errors Continue traversal

4.4 Data Structures (No Full Code)

  • Entry record: name, inode, type hint, stat info.
  • Traversal stack: list of directories to visit.

4.4 Algorithm Overview

Key Algorithm: recursive traversal

  1. Read directory entries.
  2. For each entry, fetch metadata.
  3. Print details.
  4. Recurse into subdirectories.

Complexity Analysis:

  • Time: O(n) for n entries.
  • Space: O(d) for recursion depth.

5. Implementation Guide

5.1 Development Environment Setup

# Install a C compiler and make

5.2 Project Structure

project-root/
├── src/
│   ├── myls.c
│   └── format.c
├── tests/
│   └── ls_tests.sh
└── README.md

5.3 The Core Question You’re Answering

“Where is the filename stored, and where is the metadata stored?”

5.4 Concepts You Must Understand First

  1. Inodes vs directory entries
    • What lives where?
    • Book Reference: “The Linux Programming Interface” - Ch. 15-18
  2. Permission decoding
    • How to map mode bits to rwx strings.
    • Book Reference: “The Linux Programming Interface” - file attributes

5.5 Questions to Guide Your Design

  1. How will you avoid infinite recursion?
  2. How will you handle permission errors without stopping the scan?

5.6 Thinking Exercise

Metadata Map

Pick a file and list which attributes come from the inode.

5.7 The Interview Questions They’ll Ask

  1. “What is an inode?”
  2. “What is the difference between hard and soft links?”
  3. “Why does ls need to call stat?”
  4. “How do permissions affect directory traversal?”

5.8 Hints in Layers

Hint 1: Start with non-recursive Get ls -l output for a single directory.

Hint 2: Add recursion Descend into subdirectories and track visited paths.

Hint 3: Permissions Check mode bits carefully and format correctly.

Hint 4: Debugging Compare output with ls -lR on the same tree.

5.9 Books That Will Help

Topic Book Chapter
File metadata “The Linux Programming Interface” Ch. 15
Directories “The Linux Programming Interface” Ch. 18

5.10 Implementation Phases

Phase 1: Foundation (1 day)

Goals:

  • List a single directory.

Tasks:

  1. Read directory entries.
  2. Print names.

Checkpoint: ./myls . works.

Phase 2: Core Functionality (2 days)

Goals:

  • Add long listing and recursion.

Tasks:

  1. Fetch metadata for each entry.
  2. Recurse into subdirectories.

Checkpoint: ./myls -lR sample/ matches expected output.

Phase 3: Polish & Edge Cases (1 day)

Goals:

  • Handle errors gracefully.

Tasks:

  1. Skip unreadable directories.
  2. Avoid symlink loops.

Checkpoint: Traversal never hangs.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Symlink policy Follow vs skip Skip by default Avoid cycles
Formatting Exact ls vs simplified Near-ls Easier validation

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Permission decoding Mode bit mapping
Integration Tests Directory walk sample tree
Edge Case Tests Permission denied protected dir

6.2 Critical Test Cases

  1. Simple directory: correct output order and metadata.
  2. Symlink loop: traversal does not hang.
  3. Permission error: error printed, scan continues.

6.3 Test Data

Sample tree with a symlink and a protected directory

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
No stat calls Missing metadata Call stat for each entry
Symlink loops Infinite recursion Skip or track visited inodes
Wrong permissions Incorrect rwx Verify bit mapping

7.2 Debugging Strategies

  • Compare with ls: diff outputs for the same directory.
  • Print inode numbers: confirm hard links.

7.3 Performance Traps

Stat calls dominate runtime on network filesystems; minimize redundant stats.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add -a for hidden files.
  • Add -h for human-readable sizes.

8.2 Intermediate Extensions

  • Sort by size or time.
  • Add color output by file type.

8.3 Advanced Extensions

  • Implement find-style filtering.
  • Add JSON output.

9. Real-World Connections

9.1 Industry Applications

  • File scanners: indexing and backup tools.
  • Security auditing: permission and ownership checks.
  • coreutils ls: https://www.gnu.org/software/coreutils/ - reference implementation.
  • findutils: https://www.gnu.org/software/findutils/ - traversal toolkit.

9.3 Interview Relevance

Understanding inodes and permissions is a classic Linux interview topic.


10. Resources

10.1 Essential Reading

  • “The Linux Programming Interface” - file and directory chapters
  • VFS documentation - docs.kernel.org

10.2 Video Resources

  • “Unix filesystems explained” - lectures (search title)

10.3 Tools & Documentation

  • man7: https://man7.org/linux/man-pages/man2/stat.2.html
  • VFS docs: https://docs.kernel.org/filesystems/vfs.html

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain inode vs directory entry
  • I can decode permission bits
  • I understand why stat is separate from readdir

11.2 Implementation

  • All functional requirements are met
  • Traversal avoids cycles
  • Output matches expected format

11.3 Growth

  • I can explain this project in an interview
  • I documented lessons learned
  • I can propose an extension

12. Submission / Completion Criteria

Minimum Viable Completion:

  • List directory contents recursively
  • Print correct permissions and sizes
  • Handle permission errors

Full Completion:

  • All minimum criteria plus:
  • Support -a and -h options
  • Deterministic output on sample tree

Excellence (Going Above & Beyond):

  • Sorting, filtering, and JSON output
  • Performance optimizations for large trees