Project 10: Globbing Engine

Implement globbing (wildcard expansion) for filenames with proper dotfile semantics.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate (The Developer)
Time Estimate	1 week
Main Programming Language	C
Alternative Programming Languages	Rust, Go
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold” (Educational/Personal Brand)
Prerequisites	filesystem APIs, pattern matching, argv construction
Key Topics	wildcards, directory traversal, sorting

1. Learning Objectives

By completing this project, you will:

Explain and implement wildcards in the context of a shell.
Build a working globbing engine that matches the project specification.
Design tests that validate correctness and edge cases.
Document design decisions, trade-offs, and limitations.

2. All Theory Needed (Per-Concept Breakdown)

Globbing and Filename Expansion

Fundamentals Globbing expands wildcard patterns like *, ?, and [...] into matching filenames. It happens after variable expansion and before command execution. A correct globbing engine respects quoting rules (quoted wildcards do not expand), handles dotfiles appropriately, and returns a sorted list of matches. If no matches are found, shells either leave the pattern unchanged or report an error depending on options. Globbing is essential for user productivity and for scripts that operate on file sets.

Deep Dive into the concept Globbing is essentially pattern matching over directory entries. The simplest wildcard, *, matches any sequence of characters (including empty). ? matches exactly one character. Character classes like [abc] match any of the listed characters, and ranges like [a-z] match any character in that range. Some shells support negated classes ([!a-z]) and brace expansion ({foo,bar}), but brace expansion is usually a separate pre-globbing step. Your engine should at least implement the core POSIX wildcard rules.

Matching is not just string matching; it is filesystem traversal. The pattern may include path separators, so you need to apply matching per path segment. For example, src/*.c should match src directory entries that end in .c but should not traverse further. A standard approach is to split the pattern by / and recursively match each segment against the entries of the current directory. If a segment contains no wildcard characters, you can simply append it to the path and continue; if it does contain wildcards, you scan the directory and test each entry. This approach avoids full directory recursion when it is not needed.

Dotfiles (names beginning with .) are special. In POSIX shells, wildcards do not match dotfiles unless the pattern explicitly begins with a dot. This is why * does not include .git but .* does. Your pattern matcher must implement this rule at the segment level, not globally. Another subtle rule is that . and .. should be excluded from matches unless explicitly requested; failing to do this can cause dangerous expansions in scripts.

Sorting is also expected. Most shells return glob matches in lexicographic order. This order matters for scripts that rely on predictable expansion. If you return matches in filesystem order, results will appear random across runs. Implement a sort using strcmp or locale-aware comparison if you want to be fully POSIX-compliant.

If a pattern matches nothing, behavior depends on options. POSIX shells leave the pattern unchanged (so echo *.foo prints *.foo if no matches exist). Some shells offer options like nullglob (return nothing) or failglob (treat as error). For your project, define a default and document it clearly. The expansion engine should be designed to add optional behaviors later.

How this fits on projects Globbing feeds into argument vector construction and must respect quoting and expansion order.

Definitions & key terms

Glob: Wildcard pattern expansion to filenames.
Wildcard: *, ?, or [] pattern character.
Segment: Path component between / separators.
Dotfile rule: Wildcards do not match dotfiles unless pattern starts with ..

Mental model diagram

pattern: src/*.c
  split -> ["src", "*.c"] -> list entries in src -> match *.c

How it works (step-by-step)

Detect tokens containing wildcard characters.
Split pattern into path segments.
For each segment, list directory entries and filter by pattern.
Combine matches across segments (recursive expansion).
Sort results and return them as argv entries.

Minimal concrete example

Input:  echo src/*.c
Output: echo src/main.c src/util.c

Common misconceptions

“* matches dotfiles” -> not unless pattern starts with ..
“Globbing is regex” -> glob syntax is simpler and different.
“Order doesn’t matter” -> shells sort matches.

Check-your-understanding questions

Why does * not match .git by default?
How should src/*/*.c be expanded?
What happens when no files match a glob?

Check-your-understanding answers

POSIX rule: dotfiles require explicit dot in pattern.
Recursively match each segment, listing subdirectories.
Usually the pattern remains unchanged unless options say otherwise.

Real-world applications

File management scripts.
Build systems that compile sets of files.
CLI tooling that accepts wildcard arguments.

Where you’ll apply it

In this project: see §3.5 Data Formats and §5.10 Phase 2.
Also used in: None

References

POSIX Shell Command Language (pattern matching).
“The Linux Programming Interface” (directory traversal).

Key insights Globbing is recursive directory matching, not regex, and it must respect dotfile rules.

Summary A correct globbing engine expands patterns into sorted file lists while preserving quoting and dotfile semantics.

Homework/Exercises to practice the concept

Write a function to match a single wildcard pattern against a string.
Extend it to traverse directories for */ patterns.
Add sorting and dotfile handling.

Solutions to the homework/exercises

Implement * and ? matching with recursion.
Split by / and recursively match segments.
Filter dotfiles unless pattern begins with . and sort results.

Argument Vector Construction and PATH Lookup

Fundamentals Before a shell can execute a command, it must convert a line of text into an argument vector (argv) and locate the program on disk. This seems simple, but it is the glue between parsing and execution. The first word becomes the command name, and the remaining words become arguments passed to the program. For external commands without a slash, the shell must search each directory in the PATH environment variable, build candidate paths, and test whether they are executable. A correct implementation handles empty path elements, . in PATH, and permission errors. Even minimal shells depend on correct argv construction and path resolution.

Deep Dive into the concept The execve() system call expects two critical inputs: argv, an array of strings where argv[0] is the program name, and envp, an array of environment variables. Building argv is easy only if you ignore quoting, escaping, and expansions; for a minimal executor, you may split on whitespace, but the code should still produce a valid, NULL-terminated array. In a full shell, argv construction happens after expansions, quote removal, and field splitting. The order matters: for example, quoted strings should not be split by whitespace, and globbing should expand to multiple argv entries. Even in a minimal shell, you should treat consecutive spaces as a single separator and preserve the token order exactly.

PATH lookup is equally subtle. If the command contains a /, the shell must treat it as a path and attempt to execute directly. If it does not, the shell searches the colon-separated list in PATH. Each element can be empty; an empty element means “current directory.” When you iterate over PATH, you must build dir + "/" + cmd carefully, handle trailing slashes, and check execute permissions with access(path, X_OK) or by attempting execve and checking errno. The shell should distinguish between “not found” and “not executable”: POSIX uses exit status 127 for missing commands and 126 for found but non-executable commands. For errors like ENOEXEC (text file without shebang), the shell may choose to run it with /bin/sh or return an error, depending on your design scope.

Another detail is the difference between execvp() and manual PATH search. execvp() searches PATH for you, but you still need to handle error mapping and messaging. If you implement the search manually, you can produce more informative diagnostics, record the resolved path, and implement hashing to cache results. Shells like bash maintain a hash table of command lookups to avoid repeated directory scans, invalidating the cache when PATH changes. While your minimal executor may skip hashing, you should structure the code so it can be added later.

Finally, argument vector construction is not just string splitting; it is a data-structure problem. You must allocate storage for each token, store them in a contiguous char* array, and ensure the array is NULL-terminated. You also need to decide ownership and lifetimes: who frees the strings after exec or built-in handling? A consistent memory strategy will prevent leaks and double frees. Even a small shell benefits from a “command struct” that owns argv, the original line, and any metadata such as redirections.

How this fits on projects This concept connects tokenization to execution and appears in every project that launches commands or resolves filenames.

Definitions & key terms

argv: Argument vector passed to execve().
PATH: Colon-separated list of directories used for command lookup.
Shebang: #! line that selects an interpreter for a script.
X_OK: Permission check for executability.

Mental model diagram

"ls -l /tmp" -> tokens -> argv[] -> PATH search -> /bin/ls -> execve

How it works (step-by-step)

Tokenize line into words.
Build argv array and append NULL terminator.
If command contains /, attempt direct execve.
Else iterate PATH entries, build candidate paths, test execute.
On success, execve the resolved path; on failure, map errno.

Minimal concrete example

char *argv[] = {"ls", "-l", NULL};
execvp(argv[0], argv); // uses PATH automatically

Common misconceptions

“argv[0] doesn’t matter” -> many programs read argv[0] for mode.
“PATH search is simple” -> empty elements mean current directory.
“execvp handles errors” -> you must map errno to shell status.

Check-your-understanding questions

When should the shell skip PATH search?
What is the meaning of an empty PATH element?
Why is argv required to be NULL-terminated?

Check-your-understanding answers

When the command contains a / path separator.
It refers to the current directory.
execve uses NULL to know where the argument list ends.

Real-world applications

Any CLI launcher or shell.
Build systems and process managers.
Scripting environments that run external tools.

Where you’ll apply it

In this project: see §3.2 Functional Requirements and §4.4 Data Structures.
Also used in: P01 Minimal Command Executor, P02 Shell Lexer/Tokenizer, P07 Environment Variable Manager, P13 Tab Completion Engine

References

POSIX Shell Command Language (command search).
“Advanced Programming in the UNIX Environment” (exec family).

Key insights Correct argv construction and PATH lookup prevent confusing “command not found” failures.

Summary Tokenization and PATH search are the bridge from text input to executable code.

Homework/Exercises to practice the concept

Implement a manual PATH search and print the resolved path.
Test how empty PATH entries behave with . directories.
Create a command with a slash and confirm PATH is ignored.

Solutions to the homework/exercises

Split PATH on : and join with / and the command name.
Insert empty elements and confirm lookup uses current directory.
Use ./program to verify direct execution is attempted.

3. Project Specification

3.1 What You Will Build

A globbing engine that expands patterns like *.c into matching filenames.

Included:

Core feature set described above
Deterministic CLI behavior and exit codes

Excluded:

Brace expansion optional.

3.2 Functional Requirements

Requirement 1: Detect wildcard patterns in tokens.
Requirement 2: Expand *, ?, and [] patterns.
Requirement 3: Respect dotfile matching rules.
Requirement 4: Sort matches deterministically.
Requirement 5: Leave pattern unchanged if no matches (default behavior).

3.3 Non-Functional Requirements

Performance: Interactive latency under 50ms for typical inputs; pipeline setup should scale linearly.
Reliability: No crashes on malformed input; errors reported clearly with non-zero status.
Usability: Clear prompts, deterministic behavior, and predictable error messages.

3.4 Example Usage / Output

$ ./mysh
mysh> ls *.c
lexer.c  parser.c  main.c
mysh> echo file?.txt
file1.txt file2.txt
mysh> echo data_[a-c].json
data_a.json data_b.json data_c.json

3.5 Data Formats / Schemas / Protocols

Pattern tokens and expanded argv lists.

3.6 Edge Cases

No matches
Hidden files
Nested directory patterns

3.7 Real World Outcome

This is the exact behavior you should be able to demonstrate.

3.7.1 How to Run (Copy/Paste)

make
./mysh

3.7.2 Golden Path Demo (Deterministic)

$ ./mysh
mysh> ls *.c
lexer.c  parser.c  main.c
mysh> echo file?.txt
file1.txt file2.txt
mysh> echo data_[a-c].json
data_a.json data_b.json data_c.json

3.7.3 Failure Demo (Deterministic)

$ ./mysh
mysh> not_a_command
mysh> echo $?
127

4. Solution Architecture

4.1 High-Level Design

[Input] -> [Parser/Lexer] -> [Core Engine] -> [Executor/Output]

4.2 Key Components

Component	Responsibility	Key Decisions
Pattern Matcher	Matches wildcard tokens	Recursive or iterative matching.
Directory Walker	Lists entries and filters by pattern	Uses readdir.
Expander	Replaces token with matches	Preserves argv order.

4.4 Data Structures (No Full Code)

struct GlobResult { char **matches; size_t count; };

4.4 Algorithm Overview

Key Algorithm: Segment Match

Split by /
Match each segment
Combine results

Complexity Analysis:

Time: O(n*m) typical
Space: O(n*m) typical

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies (if any)
# build
make

5.2 Project Structure

project-root/
├── src/
│   ├── main.c
│   ├── lexer.c
│   └── executor.c
├── tests/
│   └── test_basic.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

How does the shell transform a pattern into a concrete file list?

5.4 Concepts You Must Understand First

Stop and research these before coding:

Pattern syntax
Filesystem traversal
Expansion order

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

The “Hidden Files” Problem

Why doesn’t * match .bashrc by default?

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

Hint 1: Use fnmatch POSIX provides a tested glob matcher.

Hint 2: Skip dotfiles unless pattern starts with ‘.’ Match .* only if pattern begins with ..

Hint 3: Preserve sort order Sort results for deterministic behavior.

Hint 4: Handle no matches If none, leave pattern as-is (POSIX default).

5.9 Books That Will Help

Topic	Book	Chapter
Shell expansions	“Effective Shell”	Ch. 3
Directory traversal	“The Linux Programming Interface”	Ch. 18
Pattern matching	POSIX Shell Command Language	Pattern section

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

Define data structures and interfaces
Build a minimal end-to-end demo

Tasks:

Implement the core data structures
Build a tiny CLI or harness for manual tests

Checkpoint: A demo command runs end-to-end with clear logging.

Phase 2: Core Functionality (1 week)

Goals:

Implement full feature set
Validate with unit tests

Tasks:

Implement core requirements
Add error handling and edge cases

Checkpoint: All functional requirements pass basic tests.

Phase 3: Polish & Edge Cases (2-4 days)

Goals:

Harden for weird inputs
Improve UX and documentation

Tasks:

Add edge-case tests
Document design decisions

Checkpoint: Deterministic golden demo and clean error output.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Parsing depth	Minimal vs full	Incremental	Start small, expand safely
Error policy	Silent vs verbose	Verbose	Debuggability for learners

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Test individual components	Tokenizer, matcher, env builder
Integration Tests	Test component interactions	Full command lines
Edge Case Tests	Handle boundary conditions	Empty input, bad args

6.2 Critical Test Cases

Golden Path: Run the canonical demo and verify output.
Failure Path: Provide invalid input and confirm error status.
Stress Path: Run repeated commands to detect leaks or state corruption.

6.3 Test Data

input: echo hello
output: hello

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Misordered redirection	Output goes to wrong place	Apply redirections left-to-right
Leaked file descriptors	Commands hang waiting for EOF	Close unused fds in parent/child
Incorrect exit status	`&&`/`\|\|` behave wrong	Use waitpid macros correctly

7.2 Debugging Strategies

Trace syscalls: Use strace/dtruss to verify fork/exec/dup2 order.
Log state transitions: Print parser states and job table changes in debug mode.
Compare with dash: Run the same input in a reference shell.

7.3 Performance Traps

Avoid O(n^2) behavior in hot paths like line editing.
Minimize allocations inside the REPL loop.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a help built-in with usage docs.
Add colored prompt themes.

8.2 Intermediate Extensions

Add a simple profiling mode for command timing.
Implement a which built-in using PATH lookup.

8.3 Advanced Extensions

Add programmable completion or plugin system.
Add a scriptable test harness with golden outputs.

9. Real-World Connections

9.1 Industry Applications

Build systems: shells orchestrate compilation and test pipelines.
DevOps automation: scripts manage deployments and infrastructure.

bash: The most common interactive shell.
dash: Minimal POSIX shell often used as /bin/sh.
zsh: Feature-rich interactive shell.

9.3 Interview Relevance

Process creation and lifecycle questions.
Parsing and system programming design trade-offs.

10. Resources

10.1 Essential Reading

“Effective Shell” by Dave Kerr - focus on the chapters relevant to this project.
“Advanced Programming in the UNIX Environment” - process control and pipes.

10.2 Video Resources

Unix process model lectures (any OS course).
Compiler front-end videos for lexing/parsing projects.

10.3 Tools & Documentation

strace/dtruss: inspect syscalls.
man pages: fork, execve, waitpid, pipe, dup2.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the core concept without notes.
I can trace a command through my subsystem.
I understand at least one key design trade-off.

11.2 Implementation

All functional requirements are met.
All critical tests pass.
Edge cases are handled cleanly.

11.3 Growth

I documented lessons learned.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Core feature works for the golden demo.
Errors are handled with non-zero status.
Code is readable and buildable.

Full Completion:

All functional requirements met.
Tests cover edge cases and failures.

Excellence (Going Above & Beyond):

Performance benchmarks and clear documentation.
Behavior compared against a reference shell.

Project 10: Globbing Engine

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Globbing and Filename Expansion

Argument Vector Construction and PATH Lookup

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish & Edge Cases (2-4 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria