Project 2: Shell Lexer/Tokenizer
A lexer that breaks shell input like
echo "hello world" | grep -i hello > output.txtinto a stream of typed tokens: WORD, PIPE, REDIRECT, DQUOTE_STRING, etc.
Quick Reference
| Attribute | Value |
|---|---|
| Primary Language | C |
| Alternative Languages | Rust, OCaml, Python |
| Difficulty | Level 2: Intermediate (The Developer) |
| Time Estimate | 1 week |
| Knowledge Area | Compilers / Lexical Analysis |
| Tooling | Shell Parser |
| Prerequisites | Project 1, basic understanding of state machines |
What You Will Build
A lexer that breaks shell input like echo "hello world" | grep -i hello > output.txt into a stream of typed tokens: WORD, PIPE, REDIRECT, DQUOTE_STRING, etc.
Why It Matters
This project builds core skills that appear repeatedly in real-world systems and tooling.
Core Challenges
- Handling different quote types (single vs double vs backtick) → maps to lexer states
-
Recognizing operators ( , >, », <, &&, , ;) → maps to token classification - Escape character handling (backslash) → maps to character lookahead
- Distinguishing operators from text (
>vs->vsfile>) → maps to context-sensitive lexing - Preserving whitespace in quotes (“hello world” is one token) → maps to state management
Key Concepts
- Lexer design patterns: “Language Implementation Patterns” Chapter 2 - Terence Parr
- State machines for lexing: “Compilers: Principles and Practice” Chapter 3 - Dave & Dave
- Shell quoting rules: POSIX Shell Specification Section 2.2 - The Open Group
Real-World Outcome
$ echo 'echo "hello world" | grep hello' | ./shell_lexer
Token[WORD]: echo
Token[DQUOTE_STRING]: hello world
Token[PIPE]: |
Token[WORD]: grep
Token[WORD]: hello
$ echo "ls -la > 'my file.txt'" | ./shell_lexer
Token[WORD]: ls
Token[WORD]: -la
Token[REDIRECT_OUT]: >
Token[SQUOTE_STRING]: my file.txt
Implementation Guide
- Reproduce the simplest happy-path scenario.
- Build the smallest working version of the core feature.
- Add input validation and error handling.
- Add instrumentation/logging to confirm behavior.
- Refactor into clean modules with tests.
Milestones
- Milestone 1: Minimal working program that runs end-to-end.
- Milestone 2: Correct outputs for typical inputs.
- Milestone 3: Robust handling of edge cases.
- Milestone 4: Clean structure and documented usage.
Validation Checklist
- Output matches the real-world outcome example
- Handles invalid inputs safely
- Provides clear errors and exit codes
- Repeatable results across runs
References
- Main guide:
SHELL_INTERNALS_DEEP_DIVE_PROJECTS.md - “Language Implementation Patterns” by Terence Parr