Project 10: Globbing Engine
Implement globbing (wildcard expansion) for filenames with proper dotfile semantics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate (The Developer) |
| Time Estimate | 1 week |
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 1. The “Resume Gold” (Educational/Personal Brand) |
| Prerequisites | filesystem APIs, pattern matching, argv construction |
| Key Topics | wildcards, directory traversal, sorting |
1. Learning Objectives
By completing this project, you will:
- Explain and implement wildcards in the context of a shell.
- Build a working globbing engine that matches the project specification.
- Design tests that validate correctness and edge cases.
- Document design decisions, trade-offs, and limitations.
2. All Theory Needed (Per-Concept Breakdown)
Globbing and Filename Expansion
Fundamentals
Globbing expands wildcard patterns like *, ?, and [...] into matching filenames. It happens after variable expansion and before command execution. A correct globbing engine respects quoting rules (quoted wildcards do not expand), handles dotfiles appropriately, and returns a sorted list of matches. If no matches are found, shells either leave the pattern unchanged or report an error depending on options. Globbing is essential for user productivity and for scripts that operate on file sets.
Deep Dive into the concept
Globbing is essentially pattern matching over directory entries. The simplest wildcard, *, matches any sequence of characters (including empty). ? matches exactly one character. Character classes like [abc] match any of the listed characters, and ranges like [a-z] match any character in that range. Some shells support negated classes ([!a-z]) and brace expansion ({foo,bar}), but brace expansion is usually a separate pre-globbing step. Your engine should at least implement the core POSIX wildcard rules.
Matching is not just string matching; it is filesystem traversal. The pattern may include path separators, so you need to apply matching per path segment. For example, src/*.c should match src directory entries that end in .c but should not traverse further. A standard approach is to split the pattern by / and recursively match each segment against the entries of the current directory. If a segment contains no wildcard characters, you can simply append it to the path and continue; if it does contain wildcards, you scan the directory and test each entry. This approach avoids full directory recursion when it is not needed.
Dotfiles (names beginning with .) are special. In POSIX shells, wildcards do not match dotfiles unless the pattern explicitly begins with a dot. This is why * does not include .git but .* does. Your pattern matcher must implement this rule at the segment level, not globally. Another subtle rule is that . and .. should be excluded from matches unless explicitly requested; failing to do this can cause dangerous expansions in scripts.
Sorting is also expected. Most shells return glob matches in lexicographic order. This order matters for scripts that rely on predictable expansion. If you return matches in filesystem order, results will appear random across runs. Implement a sort using strcmp or locale-aware comparison if you want to be fully POSIX-compliant.
If a pattern matches nothing, behavior depends on options. POSIX shells leave the pattern unchanged (so echo *.foo prints *.foo if no matches exist). Some shells offer options like nullglob (return nothing) or failglob (treat as error). For your project, define a default and document it clearly. The expansion engine should be designed to add optional behaviors later.
How this fits on projects Globbing feeds into argument vector construction and must respect quoting and expansion order.
Definitions & key terms
- Glob: Wildcard pattern expansion to filenames.
- Wildcard:
*,?, or[]pattern character. - Segment: Path component between
/separators. - Dotfile rule: Wildcards do not match dotfiles unless pattern starts with
..
Mental model diagram
pattern: src/*.c
split -> ["src", "*.c"] -> list entries in src -> match *.c
How it works (step-by-step)
- Detect tokens containing wildcard characters.
- Split pattern into path segments.
- For each segment, list directory entries and filter by pattern.
- Combine matches across segments (recursive expansion).
- Sort results and return them as argv entries.
Minimal concrete example
Input: echo src/*.c
Output: echo src/main.c src/util.c
Common misconceptions
- “
*matches dotfiles” -> not unless pattern starts with.. - “Globbing is regex” -> glob syntax is simpler and different.
- “Order doesn’t matter” -> shells sort matches.
Check-your-understanding questions
- Why does
*not match.gitby default? - How should
src/*/*.cbe expanded? - What happens when no files match a glob?
Check-your-understanding answers
- POSIX rule: dotfiles require explicit dot in pattern.
- Recursively match each segment, listing subdirectories.
- Usually the pattern remains unchanged unless options say otherwise.
Real-world applications
- File management scripts.
- Build systems that compile sets of files.
- CLI tooling that accepts wildcard arguments.
Where you’ll apply it
- In this project: see §3.5 Data Formats and §5.10 Phase 2.
- Also used in: None
References
- POSIX Shell Command Language (pattern matching).
- “The Linux Programming Interface” (directory traversal).
Key insights Globbing is recursive directory matching, not regex, and it must respect dotfile rules.
Summary A correct globbing engine expands patterns into sorted file lists while preserving quoting and dotfile semantics.
Homework/Exercises to practice the concept
- Write a function to match a single wildcard pattern against a string.
- Extend it to traverse directories for
*/patterns. - Add sorting and dotfile handling.
Solutions to the homework/exercises
- Implement
*and?matching with recursion. - Split by
/and recursively match segments. - Filter dotfiles unless pattern begins with
.and sort results.
Argument Vector Construction and PATH Lookup
Fundamentals
Before a shell can execute a command, it must convert a line of text into an argument vector (argv) and locate the program on disk. This seems simple, but it is the glue between parsing and execution. The first word becomes the command name, and the remaining words become arguments passed to the program. For external commands without a slash, the shell must search each directory in the PATH environment variable, build candidate paths, and test whether they are executable. A correct implementation handles empty path elements, . in PATH, and permission errors. Even minimal shells depend on correct argv construction and path resolution.
Deep Dive into the concept
The execve() system call expects two critical inputs: argv, an array of strings where argv[0] is the program name, and envp, an array of environment variables. Building argv is easy only if you ignore quoting, escaping, and expansions; for a minimal executor, you may split on whitespace, but the code should still produce a valid, NULL-terminated array. In a full shell, argv construction happens after expansions, quote removal, and field splitting. The order matters: for example, quoted strings should not be split by whitespace, and globbing should expand to multiple argv entries. Even in a minimal shell, you should treat consecutive spaces as a single separator and preserve the token order exactly.
PATH lookup is equally subtle. If the command contains a /, the shell must treat it as a path and attempt to execute directly. If it does not, the shell searches the colon-separated list in PATH. Each element can be empty; an empty element means “current directory.” When you iterate over PATH, you must build dir + "/" + cmd carefully, handle trailing slashes, and check execute permissions with access(path, X_OK) or by attempting execve and checking errno. The shell should distinguish between “not found” and “not executable”: POSIX uses exit status 127 for missing commands and 126 for found but non-executable commands. For errors like ENOEXEC (text file without shebang), the shell may choose to run it with /bin/sh or return an error, depending on your design scope.
Another detail is the difference between execvp() and manual PATH search. execvp() searches PATH for you, but you still need to handle error mapping and messaging. If you implement the search manually, you can produce more informative diagnostics, record the resolved path, and implement hashing to cache results. Shells like bash maintain a hash table of command lookups to avoid repeated directory scans, invalidating the cache when PATH changes. While your minimal executor may skip hashing, you should structure the code so it can be added later.
Finally, argument vector construction is not just string splitting; it is a data-structure problem. You must allocate storage for each token, store them in a contiguous char* array, and ensure the array is NULL-terminated. You also need to decide ownership and lifetimes: who frees the strings after exec or built-in handling? A consistent memory strategy will prevent leaks and double frees. Even a small shell benefits from a “command struct” that owns argv, the original line, and any metadata such as redirections.
How this fits on projects This concept connects tokenization to execution and appears in every project that launches commands or resolves filenames.
Definitions & key terms
- argv: Argument vector passed to
execve(). - PATH: Colon-separated list of directories used for command lookup.
- Shebang:
#!line that selects an interpreter for a script. - X_OK: Permission check for executability.
Mental model diagram
"ls -l /tmp" -> tokens -> argv[] -> PATH search -> /bin/ls -> execve
How it works (step-by-step)
- Tokenize line into words.
- Build
argvarray and append NULL terminator. - If command contains
/, attempt directexecve. - Else iterate PATH entries, build candidate paths, test execute.
- On success,
execvethe resolved path; on failure, map errno.
Minimal concrete example
char *argv[] = {"ls", "-l", NULL};
execvp(argv[0], argv); // uses PATH automatically
Common misconceptions
- “argv[0] doesn’t matter” -> many programs read argv[0] for mode.
- “PATH search is simple” -> empty elements mean current directory.
- “execvp handles errors” -> you must map errno to shell status.
Check-your-understanding questions
- When should the shell skip PATH search?
- What is the meaning of an empty PATH element?
- Why is
argvrequired to be NULL-terminated?
Check-your-understanding answers
- When the command contains a
/path separator. - It refers to the current directory.
execveuses NULL to know where the argument list ends.
Real-world applications
- Any CLI launcher or shell.
- Build systems and process managers.
- Scripting environments that run external tools.
Where you’ll apply it
- In this project: see §3.2 Functional Requirements and §4.4 Data Structures.
- Also used in: P01 Minimal Command Executor, P02 Shell Lexer/Tokenizer, P07 Environment Variable Manager, P13 Tab Completion Engine
References
- POSIX Shell Command Language (command search).
- “Advanced Programming in the UNIX Environment” (exec family).
Key insights Correct argv construction and PATH lookup prevent confusing “command not found” failures.
Summary Tokenization and PATH search are the bridge from text input to executable code.
Homework/Exercises to practice the concept
- Implement a manual PATH search and print the resolved path.
- Test how empty PATH entries behave with
.directories. - Create a command with a slash and confirm PATH is ignored.
Solutions to the homework/exercises
- Split PATH on
:and join with/and the command name. - Insert empty elements and confirm lookup uses current directory.
- Use
./programto verify direct execution is attempted.
3. Project Specification
3.1 What You Will Build
A globbing engine that expands patterns like *.c into matching filenames.
Included:
- Core feature set described above
- Deterministic CLI behavior and exit codes
Excluded:
- Brace expansion optional.
3.2 Functional Requirements
- Requirement 1: Detect wildcard patterns in tokens.
- Requirement 2: Expand
*,?, and[]patterns. - Requirement 3: Respect dotfile matching rules.
- Requirement 4: Sort matches deterministically.
- Requirement 5: Leave pattern unchanged if no matches (default behavior).
3.3 Non-Functional Requirements
- Performance: Interactive latency under 50ms for typical inputs; pipeline setup should scale linearly.
- Reliability: No crashes on malformed input; errors reported clearly with non-zero status.
- Usability: Clear prompts, deterministic behavior, and predictable error messages.
3.4 Example Usage / Output
$ ./mysh
mysh> ls *.c
lexer.c parser.c main.c
mysh> echo file?.txt
file1.txt file2.txt
mysh> echo data_[a-c].json
data_a.json data_b.json data_c.json
3.5 Data Formats / Schemas / Protocols
- Pattern tokens and expanded argv lists.
3.6 Edge Cases
- No matches
- Hidden files
- Nested directory patterns
3.7 Real World Outcome
This is the exact behavior you should be able to demonstrate.
3.7.1 How to Run (Copy/Paste)
- make
- ./mysh
3.7.2 Golden Path Demo (Deterministic)
$ ./mysh
mysh> ls *.c
lexer.c parser.c main.c
mysh> echo file?.txt
file1.txt file2.txt
mysh> echo data_[a-c].json
data_a.json data_b.json data_c.json
3.7.3 Failure Demo (Deterministic)
$ ./mysh
mysh> not_a_command
mysh> echo $?
127
4. Solution Architecture
4.1 High-Level Design
[Input] -> [Parser/Lexer] -> [Core Engine] -> [Executor/Output]
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Pattern Matcher | Matches wildcard tokens | Recursive or iterative matching. |
| Directory Walker | Lists entries and filters by pattern | Uses readdir. |
| Expander | Replaces token with matches | Preserves argv order. |
4.4 Data Structures (No Full Code)
struct GlobResult { char **matches; size_t count; };
4.4 Algorithm Overview
Key Algorithm: Segment Match
- Split by /
- Match each segment
- Combine results
Complexity Analysis:
- Time: O(n*m) typical
- Space: O(n*m) typical
5. Implementation Guide
5.1 Development Environment Setup
# install dependencies (if any)
# build
make
5.2 Project Structure
project-root/
├── src/
│ ├── main.c
│ ├── lexer.c
│ └── executor.c
├── tests/
│ └── test_basic.sh
├── Makefile
└── README.md
5.3 The Core Question You’re Answering
How does the shell transform a pattern into a concrete file list?
5.4 Concepts You Must Understand First
Stop and research these before coding:
- Pattern syntax
- Filesystem traversal
- Expansion order
5.5 Questions to Guide Your Design
5.6 Thinking Exercise
The “Hidden Files” Problem
Why doesn’t * match .bashrc by default?
5.7 The Interview Questions They’ll Ask
5.8 Hints in Layers
Hint 1: Use fnmatch
POSIX provides a tested glob matcher.
Hint 2: Skip dotfiles unless pattern starts with ‘.’
Match .* only if pattern begins with ..
Hint 3: Preserve sort order Sort results for deterministic behavior.
Hint 4: Handle no matches If none, leave pattern as-is (POSIX default).
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Shell expansions | “Effective Shell” | Ch. 3 |
| Directory traversal | “The Linux Programming Interface” | Ch. 18 |
| Pattern matching | POSIX Shell Command Language | Pattern section |
5.10 Implementation Phases
Phase 1: Foundation (2-3 days)
Goals:
- Define data structures and interfaces
- Build a minimal end-to-end demo
Tasks:
- Implement the core data structures
- Build a tiny CLI or harness for manual tests
Checkpoint: A demo command runs end-to-end with clear logging.
Phase 2: Core Functionality (1 week)
Goals:
- Implement full feature set
- Validate with unit tests
Tasks:
- Implement core requirements
- Add error handling and edge cases
Checkpoint: All functional requirements pass basic tests.
Phase 3: Polish & Edge Cases (2-4 days)
Goals:
- Harden for weird inputs
- Improve UX and documentation
Tasks:
- Add edge-case tests
- Document design decisions
Checkpoint: Deterministic golden demo and clean error output.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Parsing depth | Minimal vs full | Incremental | Start small, expand safely |
| Error policy | Silent vs verbose | Verbose | Debuggability for learners |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Test individual components | Tokenizer, matcher, env builder |
| Integration Tests | Test component interactions | Full command lines |
| Edge Case Tests | Handle boundary conditions | Empty input, bad args |
6.2 Critical Test Cases
- Golden Path: Run the canonical demo and verify output.
- Failure Path: Provide invalid input and confirm error status.
- Stress Path: Run repeated commands to detect leaks or state corruption.
6.3 Test Data
input: echo hello
output: hello
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Misordered redirection | Output goes to wrong place | Apply redirections left-to-right |
| Leaked file descriptors | Commands hang waiting for EOF | Close unused fds in parent/child |
| Incorrect exit status | &&/|| behave wrong |
Use waitpid macros correctly |
7.2 Debugging Strategies
- Trace syscalls: Use
strace/dtrussto verify fork/exec/dup2 order. - Log state transitions: Print parser states and job table changes in debug mode.
- Compare with
dash: Run the same input in a reference shell.
7.3 Performance Traps
- Avoid O(n^2) behavior in hot paths like line editing.
- Minimize allocations inside the REPL loop.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a
helpbuilt-in with usage docs. - Add colored prompt themes.
8.2 Intermediate Extensions
- Add a simple profiling mode for command timing.
- Implement a
whichbuilt-in using PATH lookup.
8.3 Advanced Extensions
- Add programmable completion or plugin system.
- Add a scriptable test harness with golden outputs.
9. Real-World Connections
9.1 Industry Applications
- Build systems: shells orchestrate compilation and test pipelines.
- DevOps automation: scripts manage deployments and infrastructure.
9.2 Related Open Source Projects
- bash: The most common interactive shell.
- dash: Minimal POSIX shell often used as /bin/sh.
- zsh: Feature-rich interactive shell.
9.3 Interview Relevance
- Process creation and lifecycle questions.
- Parsing and system programming design trade-offs.
10. Resources
10.1 Essential Reading
- “Effective Shell” by Dave Kerr - focus on the chapters relevant to this project.
- “Advanced Programming in the UNIX Environment” - process control and pipes.
10.2 Video Resources
- Unix process model lectures (any OS course).
- Compiler front-end videos for lexing/parsing projects.
10.3 Tools & Documentation
- strace/dtruss: inspect syscalls.
- man pages:
fork,execve,waitpid,pipe,dup2.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core concept without notes.
- I can trace a command through my subsystem.
- I understand at least one key design trade-off.
11.2 Implementation
- All functional requirements are met.
- All critical tests pass.
- Edge cases are handled cleanly.
11.3 Growth
- I documented lessons learned.
- I can explain this project in an interview.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Core feature works for the golden demo.
- Errors are handled with non-zero status.
- Code is readable and buildable.
Full Completion:
- All functional requirements met.
- Tests cover edge cases and failures.
Excellence (Going Above & Beyond):
- Performance benchmarks and clear documentation.
- Behavior compared against a reference shell.