Project 3: Shell Parser (AST Builder)

Build an AST parser that respects shell precedence and produces executable command trees.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced (The Engineer)
Time Estimate 1-2 weeks
Main Programming Language C
Alternative Programming Languages Rust, OCaml, Haskell
Coolness Level Level 3: Genuinely Clever
Business Potential 1. The “Resume Gold” (Educational/Personal Brand)
Prerequisites Recursive descent parsing, token streams, shell operator precedence
Key Topics AST design, precedence, redirection attachment

1. Learning Objectives

By completing this project, you will:

  1. Explain and implement AST design in the context of a shell.
  2. Build a working shell parser (ast builder) that matches the project specification.
  3. Design tests that validate correctness and edge cases.
  4. Document design decisions, trade-offs, and limitations.

2. All Theory Needed (Per-Concept Breakdown)

Parsing, Precedence, and AST Construction

Fundamentals Parsing transforms a linear stream of tokens into a structured representation that reflects operator precedence and command grouping. Shell syntax is deceptively complex: pipelines bind tighter than logical operators, and lists separated by ; or & are lower precedence than && and ||. The parser must also handle compound commands like if, while, and subshells, and integrate redirections that can appear anywhere in a simple command. A well-designed Abstract Syntax Tree (AST) makes execution straightforward because each node type encodes a specific execution strategy.

Deep Dive into the concept Shell parsing is closer to a compiler front-end than to a simple expression parser. A robust approach is recursive descent with functions that mirror the grammar: parse_list handles sequences, parse_and_or handles && and ||, parse_pipeline handles |, and parse_command handles simple commands and compound constructs. This structure enforces precedence naturally: parse_pipeline is called inside parse_and_or, so pipelines bind tighter than logical operators. Associativity matters too; pipelines and logical operators are typically left-associative, meaning a | b | c becomes a pipeline node with three children rather than nested pipelines that reverse order.

Simple commands are particularly tricky because redirections can be interleaved with words: cmd arg1 > out arg2 < in is legal and should yield a command node with arguments [cmd, arg1, arg2] plus a list of redirections. The parser must therefore accept redirection tokens at any point in a simple command, collect them, and keep the argument order intact. If your parser instead treats redirection as terminating a command, you will mis-handle valid inputs.

Compound commands add complexity. if, while, for, and case introduce nested lists of commands, often with keywords that also appear as regular words in other contexts. The parser must decide when a token like do or then is a keyword versus just a word. Most shells restrict keyword recognition to specific grammar positions, which means the parser’s current nonterminal determines whether a token is treated as a keyword. This is a subtle but essential rule.

Error handling is a first-class concern. An interactive shell should distinguish incomplete input (missing fi, unmatched )) from invalid input. Incomplete input should trigger a continuation prompt; invalid input should report a syntax error and recover to the next command boundary. This requires the parser to track expected tokens and provide meaningful diagnostics. A good parser reports the unexpected token and the context (e.g., “unexpected | after &&”). This is not just UX; it helps you debug your own grammar.

The AST is more than a tree; it is the interface between parsing and execution. Each node type (SimpleCommand, Pipeline, AndOr, List, Subshell, If, While) should hold only what execution needs: arguments, redirections, children, and control flags. If the parser builds a clean AST, the executor can be a straightforward tree walk with predictable behavior. If the AST is messy or ambiguous, execution becomes a tangle of special cases.

How this fits on projects Parsing underpins all control flow and pipeline semantics. Without a correct AST, execution will be wrong even if your runtime is perfect.

Definitions & key terms

  • AST: Abstract Syntax Tree representing command structure.
  • Precedence: Rules that determine grouping of operators.
  • Associativity: Direction of grouping for operators of equal precedence.
  • Nonterminal: A grammar production like pipeline or and_or.

Mental model diagram

Tokens --> Parser --> AST
  a | b && c
   |
   v
 AndOr
 /    Pipeline c
 /   a   b

How it works (step-by-step)

  1. Consume tokens with a recursive descent parser.
  2. Build nodes for pipelines, logical operators, and lists.
  3. Attach redirections to simple commands as a list.
  4. Produce AST root for execution phase.
  5. On error, report unexpected token and recover if interactive.

Minimal concrete example

Input:  a | b && c
AST:    AndOr(Pipeline(a,b), c)

Common misconceptions

  • “Redirections only come at the end” -> they can appear anywhere.
  • | and && have same precedence” -> pipeline binds tighter.
  • “Keywords are always keywords” -> context decides keyword meaning.

Check-your-understanding questions

  1. Why must a | b && c parse as (a | b) && c?
  2. Where should redirections be stored in the AST?
  3. How should the parser react to an unfinished if block?

Check-your-understanding answers

  1. Because pipeline has higher precedence than &&.
  2. Attached to the simple command node as a list.
  3. Mark input incomplete and request continuation.

Real-world applications

  • Shells and REPLs.
  • Configuration languages with command syntax.
  • DSLs that combine operators and commands.

Where you’ll apply it

References

  • POSIX Shell Command Language (grammar).
  • “Language Implementation Patterns” (recursive descent).

Key insights The parser shapes execution; a correct AST makes execution trivial.

Summary Parsing is the structural heart of a shell: it enforces precedence, groups commands, and defines control flow for the executor.

Homework/Exercises to practice the concept

  1. Write a tiny parser for pipelines and &&/|| and dump the AST.
  2. Add support for redirections in simple commands.
  3. Build a syntax error reporter that shows the unexpected token.

Solutions to the homework/exercises

  1. Use recursive descent with parse_pipeline inside parse_and_or.
  2. Allow redirection tokens anywhere in a simple command loop.
  3. Track the last token and expected set, print a clear message.

Lexing, Quoting, and Token Boundaries

Fundamentals Shell lexing is the process of turning raw input characters into tokens like words, operators, and redirections. Unlike many languages, shell lexing is context-sensitive: the meaning of a character depends on quoting state, escapes, and sometimes prior tokens. Spaces normally separate tokens, but inside single quotes they are literal, while inside double quotes they allow some expansions but still preserve whitespace. Backslashes can escape characters, and operators like |, &&, ||, ;, and > must be recognized as separate tokens unless quoted. A correct lexer is the gatekeeper for all later stages; if it merges or splits tokens incorrectly, the parser and executor will see a different command than the user intended.

Deep Dive into the concept Shell lexing is a state machine rather than a simple splitter. The lexer typically has at least four states: normal, single-quoted, double-quoted, and escape. In normal mode, whitespace ends a token, and operators start new tokens. In single quotes, everything is literal until the next single quote, meaning even backslashes and dollar signs are treated as ordinary characters. In double quotes, many characters are literal, but $, backticks, and backslash still have special meaning. The lexer must preserve enough information for later phases: for example, it should either emit “quoted” tokens or mark characters as quoted so that field splitting and globbing can respect them.

Redirection operators complicate lexing because they can be multi-character (>>, <<, >&) and can appear adjacent to words without spaces (2>file). A robust lexer treats 2> as a redirection operator with an explicit file descriptor, not as a word token. This requires the lexer to recognize numeric prefixes and operator sequences. The lexer also needs to handle comments: in POSIX shells, a # begins a comment only when it appears where a new token could start, not inside quotes. This is a subtle but important rule if you want POSIX-like behavior.

Incomplete input is another concern. If the user enters an unmatched quote or a trailing backslash, the lexer should report “incomplete” rather than “invalid” and request more input. This is a key UX feature of real shells. A lexer designed with explicit states can detect these conditions and tell the REPL to prompt with a continuation indicator (like > ) rather than failing. Handling these states early makes the parser’s job far easier, because the parser can assume the token stream is structurally complete.

The output of lexing is not just a list of strings; it is a list of typed tokens with metadata. A typical token structure might include a token type (WORD, PIPE, REDIR, AND_IF), the raw text, and a set of flags indicating which characters were quoted. This metadata determines later expansion phases. For example, if a token was inside single quotes, variable expansion must not occur; if a token is in double quotes, expansions occur but field splitting does not split on whitespace. Maintaining this metadata is the difference between a toy lexer and a usable one.

Finally, lexing must be deterministic and stable across platforms. Users expect the same input line to be tokenized the same way every time. That means your lexer should not depend on locale or encoding details unless you explicitly support them. Keeping the lexer purely ASCII with well-defined rules is a good starting point; you can expand to Unicode or locale-aware behavior later.

How this fits on projects Lexing is the front-end of your shell; it determines the correctness of parsing, expansion, and execution in every project that deals with user input.

Definitions & key terms

  • Token: A typed unit produced by the lexer.
  • Quoting state: The current rule set (normal, single, double, escape).
  • Operator token: Tokens like |, &&, ||, ;, >.
  • Comment: A # that causes the rest of the line to be ignored.

Mental model diagram

chars --> [state machine] --> tokens
   ^         |  |  |
   |         v  v  v
 normal   single double escape

How it works (step-by-step)

  1. Start in normal state with an empty token buffer.
  2. For each character, update state (enter/exit quotes, handle backslash).
  3. If whitespace in normal state, finalize the current token.
  4. If operator in normal state, emit operator token.
  5. At end-of-line, emit the final token or mark input incomplete.

Minimal concrete example

Input:  echo "a b" | grep b
Tokens: WORD(echo), WORD(a b [quoted]), PIPE(|), WORD(grep), WORD(b)

Common misconceptions

  • “Quotes are removed during lexing” -> quote removal happens after expansion.
  • “Backslash always escapes” -> backslash meaning differs by state.
  • “Whitespace always splits words” -> not inside quotes.

Check-your-understanding questions

  1. Why does single-quoted text disable variable expansion?
  2. How should 2>file be tokenized?
  3. When is # treated as a comment in POSIX shell?

Check-your-understanding answers

  1. Single quotes make all characters literal by definition.
  2. As a redirection operator with fd=2 and target file.
  3. Only when # appears where a new token could start.

Real-world applications

  • Shell input parsing and scripting.
  • Config languages with quoting rules.
  • Command-line interpreters in tools.

Where you’ll apply it

References

  • POSIX Shell Command Language (quoting rules).
  • “Language Implementation Patterns” (lexing strategies).

Key insights Lexing is a state machine that preserves quoting intent, not just strings.

Summary A correct lexer handles quotes, escapes, operators, and comments with clear state transitions, enabling correct parsing and expansion later.

Homework/Exercises to practice the concept

  1. Implement a lexer that outputs token types and quoting flags.
  2. Feed it tricky inputs like echo "a b"'c' and compare outputs.
  3. Add detection for incomplete quotes and continuation prompts.

Solutions to the homework/exercises

  1. Use a state enum and emit tokens with metadata fields.
  2. Verify quoted segments remain in the same token with flags.
  3. If state is not normal at end-of-line, request more input.

3. Project Specification

3.1 What You Will Build

A parser that turns tokens into an AST with pipelines, logical operators, and simple commands.

Included:

  • Core feature set described above
  • Deterministic CLI behavior and exit codes

Excluded:

  • No execution engine; no job control.

3.2 Functional Requirements

  1. Requirement 1: Parse pipelines, &&, ||, and command lists.
  2. Requirement 2: Attach redirections to simple command nodes.
  3. Requirement 3: Detect and report syntax errors with context.
  4. Requirement 4: Support grouping constructs like subshell parentheses where applicable.
  5. Requirement 5: Provide an AST printer for debugging.

3.3 Non-Functional Requirements

  • Performance: Interactive latency under 50ms for typical inputs; pipeline setup should scale linearly.
  • Reliability: No crashes on malformed input; errors reported clearly with non-zero status.
  • Usability: Clear prompts, deterministic behavior, and predictable error messages.

3.4 Example Usage / Output

Command Line Outcome Example:

$ echo 'cd /tmp && ls | grep foo > out.txt' | ./shell_parser
AND_IF
  +-- SIMPLE: cd /tmp
  +-- PIPE
   +-- SIMPLE: ls
   +-- SIMPLE: grep foo
        +-- REDIRECT_OUT: out.txt

3.5 Data Formats / Schemas / Protocols

  • AST nodes with type tags and child pointers.

3.6 Edge Cases

  • Operator at line start
  • Missing command after pipe
  • Redirection without target

3.7 Real World Outcome

This is the exact behavior you should be able to demonstrate.

3.7.1 How to Run (Copy/Paste)

  • make
  • ./parser_demo < test_inputs.txt

3.7.2 Golden Path Demo (Deterministic)

$ echo 'cd /tmp && ls | grep foo > out.txt' | ./shell_parser
AND_IF
  +-- SIMPLE: cd /tmp
  +-- PIPE
   +-- SIMPLE: ls
   +-- SIMPLE: grep foo
        +-- REDIRECT_OUT: out.txt

3.7.3 Failure Demo (Deterministic)

$ ./parser_demo < test_inputs.txt
tool> not_a_command
tool> echo $?
127

4. Solution Architecture

4.1 High-Level Design

[Input] -> [Parser/Lexer] -> [Core Engine] -> [Executor/Output]

4.2 Key Components

Component Responsibility Key Decisions
Parser Recursive descent by precedence level Clear grammar mapping.
AST Nodes Represent commands and control operators Minimal fields for execution.
Error Reporter Diagnostics and recovery Useful for interactive use.

4.4 Data Structures (No Full Code)

struct ASTNode { enum NodeType type; struct ASTNode *left, *right; };

4.4 Algorithm Overview

Key Algorithm: Recursive Descent

  1. parse_list -> parse_and_or -> parse_pipeline -> parse_command

Complexity Analysis:

  • Time: O(n) time
  • Space: O(n) space

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies (if any)
# build
make

5.2 Project Structure

project-root/
├── src/
│   ├── main.c
│   ├── lexer.c
│   └── executor.c
├── tests/
│   └── test_basic.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

How do you turn a flat token stream into an executable command tree?

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. Recursive descent parsing
  2. Operator precedence
  3. AST design

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

The “Precedence Trap”

Draw the AST for:

false && true || echo ok | wc -l

Questions:

  • Which operator binds first?
  • Which commands are in the pipeline?

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

Hint 1: Start with a grammar Write grammar rules for list, and_or, pipeline, command.

Hint 2: One function per rule Each function consumes tokens and returns an AST node.

Hint 3: Attach redirections as you parse Parse redirections in the simple command rule.

Hint 4: Add error recovery On error, skip tokens until newline or ;.

5.9 Books That Will Help

Topic Book Chapter
Parsing “Language Implementation Patterns” Ch. 3-4
AST design “Engineering a Compiler” Ch. 5
Shell grammar POSIX Shell Command Language Grammar sections

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

  • Define data structures and interfaces
  • Build a minimal end-to-end demo

Tasks:

  1. Implement the core data structures
  2. Build a tiny CLI or harness for manual tests

Checkpoint: A demo command runs end-to-end with clear logging.

Phase 2: Core Functionality (1 week)

Goals:

  • Implement full feature set
  • Validate with unit tests

Tasks:

  1. Implement core requirements
  2. Add error handling and edge cases

Checkpoint: All functional requirements pass basic tests.

Phase 3: Polish & Edge Cases (2-4 days)

Goals:

  • Harden for weird inputs
  • Improve UX and documentation

Tasks:

  1. Add edge-case tests
  2. Document design decisions

Checkpoint: Deterministic golden demo and clean error output.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Parsing depth Minimal vs full Incremental Start small, expand safely
Error policy Silent vs verbose Verbose Debuggability for learners

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Test individual components Tokenizer, matcher, env builder
Integration Tests Test component interactions Full command lines
Edge Case Tests Handle boundary conditions Empty input, bad args

6.2 Critical Test Cases

  1. Golden Path: Run the canonical demo and verify output.
  2. Failure Path: Provide invalid input and confirm error status.
  3. Stress Path: Run repeated commands to detect leaks or state corruption.

6.3 Test Data

input: echo hello
output: hello

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Misordered redirection Output goes to wrong place Apply redirections left-to-right
Leaked file descriptors Commands hang waiting for EOF Close unused fds in parent/child
Incorrect exit status &&/|| behave wrong Use waitpid macros correctly

7.2 Debugging Strategies

  • Trace syscalls: Use strace/dtruss to verify fork/exec/dup2 order.
  • Log state transitions: Print parser states and job table changes in debug mode.
  • Compare with dash: Run the same input in a reference shell.

7.3 Performance Traps

  • Avoid O(n^2) behavior in hot paths like line editing.
  • Minimize allocations inside the REPL loop.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a help built-in with usage docs.
  • Add colored prompt themes.

8.2 Intermediate Extensions

  • Add a simple profiling mode for command timing.
  • Implement a which built-in using PATH lookup.

8.3 Advanced Extensions

  • Add programmable completion or plugin system.
  • Add a scriptable test harness with golden outputs.

9. Real-World Connections

9.1 Industry Applications

  • Build systems: shells orchestrate compilation and test pipelines.
  • DevOps automation: scripts manage deployments and infrastructure.
  • bash: The most common interactive shell.
  • dash: Minimal POSIX shell often used as /bin/sh.
  • zsh: Feature-rich interactive shell.

9.3 Interview Relevance

  • Process creation and lifecycle questions.
  • Parsing and system programming design trade-offs.

10. Resources

10.1 Essential Reading

  • “Language Implementation Patterns” by Terence Parr - focus on the chapters relevant to this project.
  • “Advanced Programming in the UNIX Environment” - process control and pipes.

10.2 Video Resources

  • Unix process model lectures (any OS course).
  • Compiler front-end videos for lexing/parsing projects.

10.3 Tools & Documentation

  • strace/dtruss: inspect syscalls.
  • man pages: fork, execve, waitpid, pipe, dup2.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the core concept without notes.
  • I can trace a command through my subsystem.
  • I understand at least one key design trade-off.

11.2 Implementation

  • All functional requirements are met.
  • All critical tests pass.
  • Edge cases are handled cleanly.

11.3 Growth

  • I documented lessons learned.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Core feature works for the golden demo.
  • Errors are handled with non-zero status.
  • Code is readable and buildable.

Full Completion:

  • All functional requirements met.
  • Tests cover edge cases and failures.

Excellence (Going Above & Beyond):

  • Performance benchmarks and clear documentation.
  • Behavior compared against a reference shell.