Project 2: Shell Lexer/Tokenizer

A lexer that breaks shell input like echo "hello world" | grep -i hello > output.txt into a stream of typed tokens: WORD, PIPE, REDIRECT, DQUOTE_STRING, etc.

Quick Reference

Attribute Value
Primary Language C
Alternative Languages Rust, OCaml, Python
Difficulty Level 2: Intermediate (The Developer)
Time Estimate 1 week
Knowledge Area Compilers / Lexical Analysis
Tooling Shell Parser
Prerequisites Project 1, basic understanding of state machines

What You Will Build

A lexer that breaks shell input like echo "hello world" | grep -i hello > output.txt into a stream of typed tokens: WORD, PIPE, REDIRECT, DQUOTE_STRING, etc.

Why It Matters

This project builds core skills that appear repeatedly in real-world systems and tooling.

Core Challenges

  • Handling different quote types (single vs double vs backtick) → maps to lexer states
  • Recognizing operators ( , >, », <, &&,   , ;) → maps to token classification
  • Escape character handling (backslash) → maps to character lookahead
  • Distinguishing operators from text (> vs -> vs file>) → maps to context-sensitive lexing
  • Preserving whitespace in quotes (“hello world” is one token) → maps to state management

Key Concepts

  • Lexer design patterns: “Language Implementation Patterns” Chapter 2 - Terence Parr
  • State machines for lexing: “Compilers: Principles and Practice” Chapter 3 - Dave & Dave
  • Shell quoting rules: POSIX Shell Specification Section 2.2 - The Open Group

Real-World Outcome

$ echo 'echo "hello world" | grep hello' | ./shell_lexer
Token[WORD]: echo
Token[DQUOTE_STRING]: hello world
Token[PIPE]: |
Token[WORD]: grep
Token[WORD]: hello
$ echo "ls -la > 'my file.txt'" | ./shell_lexer
Token[WORD]: ls
Token[WORD]: -la
Token[REDIRECT_OUT]: >
Token[SQUOTE_STRING]: my file.txt

Implementation Guide

  1. Reproduce the simplest happy-path scenario.
  2. Build the smallest working version of the core feature.
  3. Add input validation and error handling.
  4. Add instrumentation/logging to confirm behavior.
  5. Refactor into clean modules with tests.

Milestones

  • Milestone 1: Minimal working program that runs end-to-end.
  • Milestone 2: Correct outputs for typical inputs.
  • Milestone 3: Robust handling of edge cases.
  • Milestone 4: Clean structure and documented usage.

Validation Checklist

  • Output matches the real-world outcome example
  • Handles invalid inputs safely
  • Provides clear errors and exit codes
  • Repeatable results across runs

References

  • Main guide: SHELL_INTERNALS_DEEP_DIVE_PROJECTS.md
  • “Language Implementation Patterns” by Terence Parr