Project 2: Shell Lexer/Tokenizer

Implement a lexer that tokenizes shell input with quoting and operator awareness.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate (The Developer)
Time Estimate 1 week
Main Programming Language C
Alternative Programming Languages Rust, OCaml, Python
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold” (Educational/Personal Brand)
Prerequisites C strings, finite state machines, shell quoting basics
Key Topics lexing states, quoting, operator tokens

1. Learning Objectives

By completing this project, you will:

  1. Explain and implement lexing states in the context of a shell.
  2. Build a working shell lexer/tokenizer that matches the project specification.
  3. Design tests that validate correctness and edge cases.
  4. Document design decisions, trade-offs, and limitations.

2. All Theory Needed (Per-Concept Breakdown)

Lexing, Quoting, and Token Boundaries

Fundamentals Shell lexing is the process of turning raw input characters into tokens like words, operators, and redirections. Unlike many languages, shell lexing is context-sensitive: the meaning of a character depends on quoting state, escapes, and sometimes prior tokens. Spaces normally separate tokens, but inside single quotes they are literal, while inside double quotes they allow some expansions but still preserve whitespace. Backslashes can escape characters, and operators like |, &&, ||, ;, and > must be recognized as separate tokens unless quoted. A correct lexer is the gatekeeper for all later stages; if it merges or splits tokens incorrectly, the parser and executor will see a different command than the user intended.

Deep Dive into the concept Shell lexing is a state machine rather than a simple splitter. The lexer typically has at least four states: normal, single-quoted, double-quoted, and escape. In normal mode, whitespace ends a token, and operators start new tokens. In single quotes, everything is literal until the next single quote, meaning even backslashes and dollar signs are treated as ordinary characters. In double quotes, many characters are literal, but $, backticks, and backslash still have special meaning. The lexer must preserve enough information for later phases: for example, it should either emit “quoted” tokens or mark characters as quoted so that field splitting and globbing can respect them.

Redirection operators complicate lexing because they can be multi-character (>>, <<, >&) and can appear adjacent to words without spaces (2>file). A robust lexer treats 2> as a redirection operator with an explicit file descriptor, not as a word token. This requires the lexer to recognize numeric prefixes and operator sequences. The lexer also needs to handle comments: in POSIX shells, a # begins a comment only when it appears where a new token could start, not inside quotes. This is a subtle but important rule if you want POSIX-like behavior.

Incomplete input is another concern. If the user enters an unmatched quote or a trailing backslash, the lexer should report “incomplete” rather than “invalid” and request more input. This is a key UX feature of real shells. A lexer designed with explicit states can detect these conditions and tell the REPL to prompt with a continuation indicator (like > ) rather than failing. Handling these states early makes the parser’s job far easier, because the parser can assume the token stream is structurally complete.

The output of lexing is not just a list of strings; it is a list of typed tokens with metadata. A typical token structure might include a token type (WORD, PIPE, REDIR, AND_IF), the raw text, and a set of flags indicating which characters were quoted. This metadata determines later expansion phases. For example, if a token was inside single quotes, variable expansion must not occur; if a token is in double quotes, expansions occur but field splitting does not split on whitespace. Maintaining this metadata is the difference between a toy lexer and a usable one.

Finally, lexing must be deterministic and stable across platforms. Users expect the same input line to be tokenized the same way every time. That means your lexer should not depend on locale or encoding details unless you explicitly support them. Keeping the lexer purely ASCII with well-defined rules is a good starting point; you can expand to Unicode or locale-aware behavior later.

How this fits on projects Lexing is the front-end of your shell; it determines the correctness of parsing, expansion, and execution in every project that deals with user input.

Definitions & key terms

  • Token: A typed unit produced by the lexer.
  • Quoting state: The current rule set (normal, single, double, escape).
  • Operator token: Tokens like |, &&, ||, ;, >.
  • Comment: A # that causes the rest of the line to be ignored.

Mental model diagram

chars --> [state machine] --> tokens
   ^         |  |  |
   |         v  v  v
 normal   single double escape

How it works (step-by-step)

  1. Start in normal state with an empty token buffer.
  2. For each character, update state (enter/exit quotes, handle backslash).
  3. If whitespace in normal state, finalize the current token.
  4. If operator in normal state, emit operator token.
  5. At end-of-line, emit the final token or mark input incomplete.

Minimal concrete example

Input:  echo "a b" | grep b
Tokens: WORD(echo), WORD(a b [quoted]), PIPE(|), WORD(grep), WORD(b)

Common misconceptions

  • “Quotes are removed during lexing” -> quote removal happens after expansion.
  • “Backslash always escapes” -> backslash meaning differs by state.
  • “Whitespace always splits words” -> not inside quotes.

Check-your-understanding questions

  1. Why does single-quoted text disable variable expansion?
  2. How should 2>file be tokenized?
  3. When is # treated as a comment in POSIX shell?

Check-your-understanding answers

  1. Single quotes make all characters literal by definition.
  2. As a redirection operator with fd=2 and target file.
  3. Only when # appears where a new token could start.

Real-world applications

  • Shell input parsing and scripting.
  • Config languages with quoting rules.
  • Command-line interpreters in tools.

Where you’ll apply it

References

  • POSIX Shell Command Language (quoting rules).
  • “Language Implementation Patterns” (lexing strategies).

Key insights Lexing is a state machine that preserves quoting intent, not just strings.

Summary A correct lexer handles quotes, escapes, operators, and comments with clear state transitions, enabling correct parsing and expansion later.

Homework/Exercises to practice the concept

  1. Implement a lexer that outputs token types and quoting flags.
  2. Feed it tricky inputs like echo "a b"'c' and compare outputs.
  3. Add detection for incomplete quotes and continuation prompts.

Solutions to the homework/exercises

  1. Use a state enum and emit tokens with metadata fields.
  2. Verify quoted segments remain in the same token with flags.
  3. If state is not normal at end-of-line, request more input.

Argument Vector Construction and PATH Lookup

Fundamentals Before a shell can execute a command, it must convert a line of text into an argument vector (argv) and locate the program on disk. This seems simple, but it is the glue between parsing and execution. The first word becomes the command name, and the remaining words become arguments passed to the program. For external commands without a slash, the shell must search each directory in the PATH environment variable, build candidate paths, and test whether they are executable. A correct implementation handles empty path elements, . in PATH, and permission errors. Even minimal shells depend on correct argv construction and path resolution.

Deep Dive into the concept The execve() system call expects two critical inputs: argv, an array of strings where argv[0] is the program name, and envp, an array of environment variables. Building argv is easy only if you ignore quoting, escaping, and expansions; for a minimal executor, you may split on whitespace, but the code should still produce a valid, NULL-terminated array. In a full shell, argv construction happens after expansions, quote removal, and field splitting. The order matters: for example, quoted strings should not be split by whitespace, and globbing should expand to multiple argv entries. Even in a minimal shell, you should treat consecutive spaces as a single separator and preserve the token order exactly.

PATH lookup is equally subtle. If the command contains a /, the shell must treat it as a path and attempt to execute directly. If it does not, the shell searches the colon-separated list in PATH. Each element can be empty; an empty element means “current directory.” When you iterate over PATH, you must build dir + "/" + cmd carefully, handle trailing slashes, and check execute permissions with access(path, X_OK) or by attempting execve and checking errno. The shell should distinguish between “not found” and “not executable”: POSIX uses exit status 127 for missing commands and 126 for found but non-executable commands. For errors like ENOEXEC (text file without shebang), the shell may choose to run it with /bin/sh or return an error, depending on your design scope.

Another detail is the difference between execvp() and manual PATH search. execvp() searches PATH for you, but you still need to handle error mapping and messaging. If you implement the search manually, you can produce more informative diagnostics, record the resolved path, and implement hashing to cache results. Shells like bash maintain a hash table of command lookups to avoid repeated directory scans, invalidating the cache when PATH changes. While your minimal executor may skip hashing, you should structure the code so it can be added later.

Finally, argument vector construction is not just string splitting; it is a data-structure problem. You must allocate storage for each token, store them in a contiguous char* array, and ensure the array is NULL-terminated. You also need to decide ownership and lifetimes: who frees the strings after exec or built-in handling? A consistent memory strategy will prevent leaks and double frees. Even a small shell benefits from a “command struct” that owns argv, the original line, and any metadata such as redirections.

How this fits on projects This concept connects tokenization to execution and appears in every project that launches commands or resolves filenames.

Definitions & key terms

  • argv: Argument vector passed to execve().
  • PATH: Colon-separated list of directories used for command lookup.
  • Shebang: #! line that selects an interpreter for a script.
  • X_OK: Permission check for executability.

Mental model diagram

"ls -l /tmp" -> tokens -> argv[] -> PATH search -> /bin/ls -> execve

How it works (step-by-step)

  1. Tokenize line into words.
  2. Build argv array and append NULL terminator.
  3. If command contains /, attempt direct execve.
  4. Else iterate PATH entries, build candidate paths, test execute.
  5. On success, execve the resolved path; on failure, map errno.

Minimal concrete example

char *argv[] = {"ls", "-l", NULL};
execvp(argv[0], argv); // uses PATH automatically

Common misconceptions

  • “argv[0] doesn’t matter” -> many programs read argv[0] for mode.
  • “PATH search is simple” -> empty elements mean current directory.
  • “execvp handles errors” -> you must map errno to shell status.

Check-your-understanding questions

  1. When should the shell skip PATH search?
  2. What is the meaning of an empty PATH element?
  3. Why is argv required to be NULL-terminated?

Check-your-understanding answers

  1. When the command contains a / path separator.
  2. It refers to the current directory.
  3. execve uses NULL to know where the argument list ends.

Real-world applications

  • Any CLI launcher or shell.
  • Build systems and process managers.
  • Scripting environments that run external tools.

Where you’ll apply it

References

  • POSIX Shell Command Language (command search).
  • “Advanced Programming in the UNIX Environment” (exec family).

Key insights Correct argv construction and PATH lookup prevent confusing “command not found” failures.

Summary Tokenization and PATH search are the bridge from text input to executable code.

Homework/Exercises to practice the concept

  1. Implement a manual PATH search and print the resolved path.
  2. Test how empty PATH entries behave with . directories.
  3. Create a command with a slash and confirm PATH is ignored.

Solutions to the homework/exercises

  1. Split PATH on : and join with / and the command name.
  2. Insert empty elements and confirm lookup uses current directory.
  3. Use ./program to verify direct execution is attempted.

3. Project Specification

3.1 What You Will Build

A shell lexer that emits tokens with types and quoting metadata for later parsing.

Included:

  • Core feature set described above
  • Deterministic CLI behavior and exit codes

Excluded:

  • No AST parsing; no execution.

3.2 Functional Requirements

  1. Requirement 1: Tokenize words, operators, and redirections.
  2. Requirement 2: Track quoting state (normal, single, double, escape).
  3. Requirement 3: Preserve token boundaries for operators like |, &&, ||, >.
  4. Requirement 4: Detect incomplete input (unterminated quotes or escapes).
  5. Requirement 5: Expose a token stream with type and raw text.

3.3 Non-Functional Requirements

  • Performance: Interactive latency under 50ms for typical inputs; pipeline setup should scale linearly.
  • Reliability: No crashes on malformed input; errors reported clearly with non-zero status.
  • Usability: Clear prompts, deterministic behavior, and predictable error messages.

3.4 Example Usage / Output

What you will see:

  1. A token stream that preserves words, operators, and quoted strings.
  2. Accurate position metadata for error reporting.
  3. Predictable handling of whitespace and comments.

Command Line Outcome Example:

$ echo 'echo "hello world" | grep -i hello > out.txt' | ./shell_lexer
Token[WORD]         echo
Token[DQUOTE]       hello world
Token[PIPE]         |
Token[WORD]         grep
Token[WORD]         -i
Token[WORD]         hello
Token[REDIRECT_OUT] >
Token[WORD]         out.txt

3.5 Data Formats / Schemas / Protocols

  • Token stream: list of {type, text, quoted_flags}.

3.6 Edge Cases

  • Unclosed quotes
  • Escaped newline
  • Redirection without spaces
  • Comment handling

3.7 Real World Outcome

This is the exact behavior you should be able to demonstrate.

3.7.1 How to Run (Copy/Paste)

  • make
  • ./lexer_demo < test_inputs.txt

3.7.2 Golden Path Demo (Deterministic)

$ echo 'echo "hello world" | grep -i hello > out.txt' | ./shell_lexer
Token[WORD]         echo
Token[DQUOTE]       hello world
Token[PIPE]         |
Token[WORD]         grep
Token[WORD]         -i
Token[WORD]         hello
Token[REDIRECT_OUT] >
Token[WORD]         out.txt

3.7.3 Failure Demo (Deterministic)

$ ./lexer_demo < test_inputs.txt
tool> not_a_command
tool> echo $?
127

4. Solution Architecture

4.1 High-Level Design

[Input] -> [Parser/Lexer] -> [Core Engine] -> [Executor/Output]

4.2 Key Components

Component Responsibility Key Decisions
State Machine Tracks quoting and escape states Explicit states simplify correctness.
Token Builder Accumulation and emission of tokens Avoids copying by using slices or buffers.
Operator Detector Recognize multi-char operators Greedy matching of &&, ||, >>.

4.4 Data Structures (No Full Code)

struct Token { enum TokenType type; char *text; unsigned flags; };

4.4 Algorithm Overview

Key Algorithm: Stateful Lexing

  1. Iterate characters
  2. Update state
  3. Emit tokens on boundaries

Complexity Analysis:

  • Time: O(n) time
  • Space: O(n) space

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies (if any)
# build
make

5.2 Project Structure

project-root/
├── src/
│   ├── main.c
│   ├── lexer.c
│   └── executor.c
├── tests/
│   └── test_basic.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

How do you divide shell input into meaningful units without losing quoting semantics?

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Finite state machines
  2. Shell quoting rules
  3. Operator precedence in lexing

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

The “Ambiguous Operator” Problem

Trace tokenization for:

echo a||b | grep ">>" > out

Questions:

  • Which > are operators vs literal text?
  • How do you ensure || is not split into | + |?

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

Hint 1: Use a state enum

typedef enum { NORMAL, IN_SQUOTE, IN_DQUOTE, ESCAPE } State;

Hint 2: Greedy operator matching Match >>, ||, && before single-character operators.

Hint 3: Preserve raw text Keep the original lexeme so expansion can later respect quotes.

Hint 4: Attach position metadata Store line/column for better error messages.

5.9 Books That Will Help

Topic Book Chapter
Lexer patterns “Language Implementation Patterns” Ch. 2
State machines “Compilers: Principles and Practice” Ch. 3
Shell quoting POSIX Shell Command Language Quoting sections

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

  • Define data structures and interfaces
  • Build a minimal end-to-end demo

Tasks:

  1. Implement the core data structures
  2. Build a tiny CLI or harness for manual tests

Checkpoint: A demo command runs end-to-end with clear logging.

Phase 2: Core Functionality (1 week)

Goals:

  • Implement full feature set
  • Validate with unit tests

Tasks:

  1. Implement core requirements
  2. Add error handling and edge cases

Checkpoint: All functional requirements pass basic tests.

Phase 3: Polish & Edge Cases (2-4 days)

Goals:

  • Harden for weird inputs
  • Improve UX and documentation

Tasks:

  1. Add edge-case tests
  2. Document design decisions

Checkpoint: Deterministic golden demo and clean error output.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Parsing depth Minimal vs full Incremental Start small, expand safely
Error policy Silent vs verbose Verbose Debuggability for learners

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Test individual components Tokenizer, matcher, env builder
Integration Tests Test component interactions Full command lines
Edge Case Tests Handle boundary conditions Empty input, bad args

6.2 Critical Test Cases

  1. Golden Path: Run the canonical demo and verify output.
  2. Failure Path: Provide invalid input and confirm error status.
  3. Stress Path: Run repeated commands to detect leaks or state corruption.

6.3 Test Data

input: echo hello
output: hello

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Misordered redirection Output goes to wrong place Apply redirections left-to-right
Leaked file descriptors Commands hang waiting for EOF Close unused fds in parent/child
Incorrect exit status &&/|| behave wrong Use waitpid macros correctly

7.2 Debugging Strategies

  • Trace syscalls: Use strace/dtruss to verify fork/exec/dup2 order.
  • Log state transitions: Print parser states and job table changes in debug mode.
  • Compare with dash: Run the same input in a reference shell.

7.3 Performance Traps

  • Avoid O(n^2) behavior in hot paths like line editing.
  • Minimize allocations inside the REPL loop.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a help built-in with usage docs.
  • Add colored prompt themes.

8.2 Intermediate Extensions

  • Add a simple profiling mode for command timing.
  • Implement a which built-in using PATH lookup.

8.3 Advanced Extensions

  • Add programmable completion or plugin system.
  • Add a scriptable test harness with golden outputs.

9. Real-World Connections

9.1 Industry Applications

  • Build systems: shells orchestrate compilation and test pipelines.
  • DevOps automation: scripts manage deployments and infrastructure.
  • bash: The most common interactive shell.
  • dash: Minimal POSIX shell often used as /bin/sh.
  • zsh: Feature-rich interactive shell.

9.3 Interview Relevance

  • Process creation and lifecycle questions.
  • Parsing and system programming design trade-offs.

10. Resources

10.1 Essential Reading

  • “Language Implementation Patterns” by Terence Parr - focus on the chapters relevant to this project.
  • “Advanced Programming in the UNIX Environment” - process control and pipes.

10.2 Video Resources

  • Unix process model lectures (any OS course).
  • Compiler front-end videos for lexing/parsing projects.

10.3 Tools & Documentation

  • strace/dtruss: inspect syscalls.
  • man pages: fork, execve, waitpid, pipe, dup2.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the core concept without notes.
  • I can trace a command through my subsystem.
  • I understand at least one key design trade-off.

11.2 Implementation

  • All functional requirements are met.
  • All critical tests pass.
  • Edge cases are handled cleanly.

11.3 Growth

  • I documented lessons learned.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Core feature works for the golden demo.
  • Errors are handled with non-zero status.
  • Code is readable and buildable.

Full Completion:

  • All functional requirements met.
  • Tests cover edge cases and failures.

Excellence (Going Above & Beyond):

  • Performance benchmarks and clear documentation.
  • Behavior compared against a reference shell.