Project 3: Shell Parser (AST Builder)

Build an AST parser that respects shell precedence and produces executable command trees.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced (The Engineer)
Time Estimate	1-2 weeks
Main Programming Language	C
Alternative Programming Languages	Rust, OCaml, Haskell
Coolness Level	Level 3: Genuinely Clever
Business Potential	1. The “Resume Gold” (Educational/Personal Brand)
Prerequisites	Recursive descent parsing, token streams, shell operator precedence
Key Topics	AST design, precedence, redirection attachment

1. Learning Objectives

By completing this project, you will:

Explain and implement AST design in the context of a shell.
Build a working shell parser (ast builder) that matches the project specification.
Design tests that validate correctness and edge cases.
Document design decisions, trade-offs, and limitations.

2. All Theory Needed (Per-Concept Breakdown)

Parsing, Precedence, and AST Construction

Fundamentals Parsing transforms a linear stream of tokens into a structured representation that reflects operator precedence and command grouping. Shell syntax is deceptively complex: pipelines bind tighter than logical operators, and lists separated by ; or & are lower precedence than && and ||. The parser must also handle compound commands like if, while, and subshells, and integrate redirections that can appear anywhere in a simple command. A well-designed Abstract Syntax Tree (AST) makes execution straightforward because each node type encodes a specific execution strategy.

Deep Dive into the concept Shell parsing is closer to a compiler front-end than to a simple expression parser. A robust approach is recursive descent with functions that mirror the grammar: parse_list handles sequences, parse_and_or handles && and ||, parse_pipeline handles |, and parse_command handles simple commands and compound constructs. This structure enforces precedence naturally: parse_pipeline is called inside parse_and_or, so pipelines bind tighter than logical operators. Associativity matters too; pipelines and logical operators are typically left-associative, meaning a | b | c becomes a pipeline node with three children rather than nested pipelines that reverse order.

Simple commands are particularly tricky because redirections can be interleaved with words: cmd arg1 > out arg2 < in is legal and should yield a command node with arguments [cmd, arg1, arg2] plus a list of redirections. The parser must therefore accept redirection tokens at any point in a simple command, collect them, and keep the argument order intact. If your parser instead treats redirection as terminating a command, you will mis-handle valid inputs.

Compound commands add complexity. if, while, for, and case introduce nested lists of commands, often with keywords that also appear as regular words in other contexts. The parser must decide when a token like do or then is a keyword versus just a word. Most shells restrict keyword recognition to specific grammar positions, which means the parser’s current nonterminal determines whether a token is treated as a keyword. This is a subtle but essential rule.

Error handling is a first-class concern. An interactive shell should distinguish incomplete input (missing fi, unmatched )) from invalid input. Incomplete input should trigger a continuation prompt; invalid input should report a syntax error and recover to the next command boundary. This requires the parser to track expected tokens and provide meaningful diagnostics. A good parser reports the unexpected token and the context (e.g., “unexpected | after &&”). This is not just UX; it helps you debug your own grammar.

The AST is more than a tree; it is the interface between parsing and execution. Each node type (SimpleCommand, Pipeline, AndOr, List, Subshell, If, While) should hold only what execution needs: arguments, redirections, children, and control flags. If the parser builds a clean AST, the executor can be a straightforward tree walk with predictable behavior. If the AST is messy or ambiguous, execution becomes a tangle of special cases.

How this fits on projects Parsing underpins all control flow and pipeline semantics. Without a correct AST, execution will be wrong even if your runtime is perfect.

Definitions & key terms

AST: Abstract Syntax Tree representing command structure.
Precedence: Rules that determine grouping of operators.
Associativity: Direction of grouping for operators of equal precedence.
Nonterminal: A grammar production like pipeline or and_or.

Mental model diagram

Tokens --> Parser --> AST
  a | b && c
   |
   v
 AndOr
 /    Pipeline c
 /   a   b

How it works (step-by-step)

Consume tokens with a recursive descent parser.
Build nodes for pipelines, logical operators, and lists.
Attach redirections to simple commands as a list.
Produce AST root for execution phase.
On error, report unexpected token and recover if interactive.

Minimal concrete example

Input:  a | b && c
AST:    AndOr(Pipeline(a,b), c)

Common misconceptions

“Redirections only come at the end” -> they can appear anywhere.
“| and && have same precedence” -> pipeline binds tighter.
“Keywords are always keywords” -> context decides keyword meaning.

Check-your-understanding questions

Why must a | b && c parse as (a | b) && c?
Where should redirections be stored in the AST?
How should the parser react to an unfinished if block?

Check-your-understanding answers

Because pipeline has higher precedence than &&.
Attached to the simple command node as a list.
Mark input incomplete and request continuation.

Real-world applications

Shells and REPLs.
Configuration languages with command syntax.
DSLs that combine operators and commands.

Where you’ll apply it

In this project: see §4 Solution Architecture and §5.10 Phase 2.
Also used in: P14 Script Interpreter, P15 POSIX-Compliant Shell, P16 Structured Data Shell (Nushell-Inspired)

References

POSIX Shell Command Language (grammar).
“Language Implementation Patterns” (recursive descent).

Key insights The parser shapes execution; a correct AST makes execution trivial.

Summary Parsing is the structural heart of a shell: it enforces precedence, groups commands, and defines control flow for the executor.

Homework/Exercises to practice the concept

Write a tiny parser for pipelines and &&/|| and dump the AST.
Add support for redirections in simple commands.
Build a syntax error reporter that shows the unexpected token.

Solutions to the homework/exercises

Use recursive descent with parse_pipeline inside parse_and_or.
Allow redirection tokens anywhere in a simple command loop.
Track the last token and expected set, print a clear message.

Lexing, Quoting, and Token Boundaries

Fundamentals Shell lexing is the process of turning raw input characters into tokens like words, operators, and redirections. Unlike many languages, shell lexing is context-sensitive: the meaning of a character depends on quoting state, escapes, and sometimes prior tokens. Spaces normally separate tokens, but inside single quotes they are literal, while inside double quotes they allow some expansions but still preserve whitespace. Backslashes can escape characters, and operators like |, &&, ||, ;, and > must be recognized as separate tokens unless quoted. A correct lexer is the gatekeeper for all later stages; if it merges or splits tokens incorrectly, the parser and executor will see a different command than the user intended.

Deep Dive into the concept Shell lexing is a state machine rather than a simple splitter. The lexer typically has at least four states: normal, single-quoted, double-quoted, and escape. In normal mode, whitespace ends a token, and operators start new tokens. In single quotes, everything is literal until the next single quote, meaning even backslashes and dollar signs are treated as ordinary characters. In double quotes, many characters are literal, but $, backticks, and backslash still have special meaning. The lexer must preserve enough information for later phases: for example, it should either emit “quoted” tokens or mark characters as quoted so that field splitting and globbing can respect them.

Redirection operators complicate lexing because they can be multi-character (>>, <<, >&) and can appear adjacent to words without spaces (2>file). A robust lexer treats 2> as a redirection operator with an explicit file descriptor, not as a word token. This requires the lexer to recognize numeric prefixes and operator sequences. The lexer also needs to handle comments: in POSIX shells, a # begins a comment only when it appears where a new token could start, not inside quotes. This is a subtle but important rule if you want POSIX-like behavior.

Incomplete input is another concern. If the user enters an unmatched quote or a trailing backslash, the lexer should report “incomplete” rather than “invalid” and request more input. This is a key UX feature of real shells. A lexer designed with explicit states can detect these conditions and tell the REPL to prompt with a continuation indicator (like > ) rather than failing. Handling these states early makes the parser’s job far easier, because the parser can assume the token stream is structurally complete.

The output of lexing is not just a list of strings; it is a list of typed tokens with metadata. A typical token structure might include a token type (WORD, PIPE, REDIR, AND_IF), the raw text, and a set of flags indicating which characters were quoted. This metadata determines later expansion phases. For example, if a token was inside single quotes, variable expansion must not occur; if a token is in double quotes, expansions occur but field splitting does not split on whitespace. Maintaining this metadata is the difference between a toy lexer and a usable one.

Finally, lexing must be deterministic and stable across platforms. Users expect the same input line to be tokenized the same way every time. That means your lexer should not depend on locale or encoding details unless you explicitly support them. Keeping the lexer purely ASCII with well-defined rules is a good starting point; you can expand to Unicode or locale-aware behavior later.

How this fits on projects Lexing is the front-end of your shell; it determines the correctness of parsing, expansion, and execution in every project that deals with user input.

Definitions & key terms

Token: A typed unit produced by the lexer.
Quoting state: The current rule set (normal, single, double, escape).
Operator token: Tokens like |, &&, ||, ;, >.
Comment: A # that causes the rest of the line to be ignored.

Mental model diagram

chars --> [state machine] --> tokens
   ^         |  |  |
   |         v  v  v
 normal   single double escape

How it works (step-by-step)

Start in normal state with an empty token buffer.
For each character, update state (enter/exit quotes, handle backslash).
If whitespace in normal state, finalize the current token.
If operator in normal state, emit operator token.
At end-of-line, emit the final token or mark input incomplete.

Minimal concrete example

Input:  echo "a b" | grep b
Tokens: WORD(echo), WORD(a b [quoted]), PIPE(|), WORD(grep), WORD(b)

Common misconceptions

“Quotes are removed during lexing” -> quote removal happens after expansion.
“Backslash always escapes” -> backslash meaning differs by state.
“Whitespace always splits words” -> not inside quotes.

Check-your-understanding questions

Why does single-quoted text disable variable expansion?
How should 2>file be tokenized?
When is # treated as a comment in POSIX shell?

Check-your-understanding answers

Single quotes make all characters literal by definition.
As a redirection operator with fd=2 and target file.
Only when # appears where a new token could start.

Real-world applications

Shell input parsing and scripting.
Config languages with quoting rules.
Command-line interpreters in tools.

Where you’ll apply it

In this project: see §2 All Theory Needed and §5.10 Phase 1.
Also used in: P02 Shell Lexer/Tokenizer

References

POSIX Shell Command Language (quoting rules).
“Language Implementation Patterns” (lexing strategies).

Key insights Lexing is a state machine that preserves quoting intent, not just strings.

Summary A correct lexer handles quotes, escapes, operators, and comments with clear state transitions, enabling correct parsing and expansion later.

Homework/Exercises to practice the concept

Implement a lexer that outputs token types and quoting flags.
Feed it tricky inputs like echo "a b"'c' and compare outputs.
Add detection for incomplete quotes and continuation prompts.

Solutions to the homework/exercises

Use a state enum and emit tokens with metadata fields.
Verify quoted segments remain in the same token with flags.
If state is not normal at end-of-line, request more input.

3. Project Specification

3.1 What You Will Build

A parser that turns tokens into an AST with pipelines, logical operators, and simple commands.

Included:

Core feature set described above
Deterministic CLI behavior and exit codes

Excluded:

No execution engine; no job control.

3.2 Functional Requirements

Requirement 1: Parse pipelines, &&, ||, and command lists.
Requirement 2: Attach redirections to simple command nodes.
Requirement 3: Detect and report syntax errors with context.
Requirement 4: Support grouping constructs like subshell parentheses where applicable.
Requirement 5: Provide an AST printer for debugging.

3.3 Non-Functional Requirements

Performance: Interactive latency under 50ms for typical inputs; pipeline setup should scale linearly.
Reliability: No crashes on malformed input; errors reported clearly with non-zero status.
Usability: Clear prompts, deterministic behavior, and predictable error messages.

3.4 Example Usage / Output

Command Line Outcome Example:

$ echo 'cd /tmp && ls | grep foo > out.txt' | ./shell_parser
AND_IF
  +-- SIMPLE: cd /tmp
  +-- PIPE
   +-- SIMPLE: ls
   +-- SIMPLE: grep foo
        +-- REDIRECT_OUT: out.txt

3.5 Data Formats / Schemas / Protocols

AST nodes with type tags and child pointers.

3.6 Edge Cases

Operator at line start
Missing command after pipe
Redirection without target

3.7 Real World Outcome

This is the exact behavior you should be able to demonstrate.

3.7.1 How to Run (Copy/Paste)

make
./parser_demo < test_inputs.txt

3.7.2 Golden Path Demo (Deterministic)

$ echo 'cd /tmp && ls | grep foo > out.txt' | ./shell_parser
AND_IF
  +-- SIMPLE: cd /tmp
  +-- PIPE
   +-- SIMPLE: ls
   +-- SIMPLE: grep foo
        +-- REDIRECT_OUT: out.txt

3.7.3 Failure Demo (Deterministic)

$ ./parser_demo < test_inputs.txt
tool> not_a_command
tool> echo $?
127

4. Solution Architecture

4.1 High-Level Design

[Input] -> [Parser/Lexer] -> [Core Engine] -> [Executor/Output]

4.2 Key Components

Component	Responsibility	Key Decisions
Parser	Recursive descent by precedence level	Clear grammar mapping.
AST Nodes	Represent commands and control operators	Minimal fields for execution.
Error Reporter	Diagnostics and recovery	Useful for interactive use.

4.4 Data Structures (No Full Code)

struct ASTNode { enum NodeType type; struct ASTNode *left, *right; };

4.4 Algorithm Overview

Key Algorithm: Recursive Descent

parse_list -> parse_and_or -> parse_pipeline -> parse_command

Complexity Analysis:

Time: O(n) time
Space: O(n) space

5. Implementation Guide

5.1 Development Environment Setup

# install dependencies (if any)
# build
make

5.2 Project Structure

project-root/
├── src/
│   ├── main.c
│   ├── lexer.c
│   └── executor.c
├── tests/
│   └── test_basic.sh
├── Makefile
└── README.md

5.3 The Core Question You’re Answering

How do you turn a flat token stream into an executable command tree?

5.4 Concepts You Must Understand First

Stop and research these before coding:

Recursive descent parsing
Operator precedence
AST design

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

The “Precedence Trap”

Draw the AST for:

false && true || echo ok | wc -l

Questions:

Which operator binds first?
Which commands are in the pipeline?

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

Hint 1: Start with a grammar Write grammar rules for list, and_or, pipeline, command.

Hint 2: One function per rule Each function consumes tokens and returns an AST node.

Hint 3: Attach redirections as you parse Parse redirections in the simple command rule.

Hint 4: Add error recovery On error, skip tokens until newline or ;.

5.9 Books That Will Help

Topic	Book	Chapter
Parsing	“Language Implementation Patterns”	Ch. 3-4
AST design	“Engineering a Compiler”	Ch. 5
Shell grammar	POSIX Shell Command Language	Grammar sections

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Goals:

Define data structures and interfaces
Build a minimal end-to-end demo

Tasks:

Implement the core data structures
Build a tiny CLI or harness for manual tests

Checkpoint: A demo command runs end-to-end with clear logging.

Phase 2: Core Functionality (1 week)

Goals:

Implement full feature set
Validate with unit tests

Tasks:

Implement core requirements
Add error handling and edge cases

Checkpoint: All functional requirements pass basic tests.

Phase 3: Polish & Edge Cases (2-4 days)

Goals:

Harden for weird inputs
Improve UX and documentation

Tasks:

Add edge-case tests
Document design decisions

Checkpoint: Deterministic golden demo and clean error output.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Parsing depth	Minimal vs full	Incremental	Start small, expand safely
Error policy	Silent vs verbose	Verbose	Debuggability for learners

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Test individual components	Tokenizer, matcher, env builder
Integration Tests	Test component interactions	Full command lines
Edge Case Tests	Handle boundary conditions	Empty input, bad args

6.2 Critical Test Cases

Golden Path: Run the canonical demo and verify output.
Failure Path: Provide invalid input and confirm error status.
Stress Path: Run repeated commands to detect leaks or state corruption.

6.3 Test Data

input: echo hello
output: hello

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Misordered redirection	Output goes to wrong place	Apply redirections left-to-right
Leaked file descriptors	Commands hang waiting for EOF	Close unused fds in parent/child
Incorrect exit status	`&&`/`\|\|` behave wrong	Use waitpid macros correctly

7.2 Debugging Strategies

Trace syscalls: Use strace/dtruss to verify fork/exec/dup2 order.
Log state transitions: Print parser states and job table changes in debug mode.
Compare with dash: Run the same input in a reference shell.

7.3 Performance Traps

Avoid O(n^2) behavior in hot paths like line editing.
Minimize allocations inside the REPL loop.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a help built-in with usage docs.
Add colored prompt themes.

8.2 Intermediate Extensions

Add a simple profiling mode for command timing.
Implement a which built-in using PATH lookup.

8.3 Advanced Extensions

Add programmable completion or plugin system.
Add a scriptable test harness with golden outputs.

9. Real-World Connections

9.1 Industry Applications

Build systems: shells orchestrate compilation and test pipelines.
DevOps automation: scripts manage deployments and infrastructure.

bash: The most common interactive shell.
dash: Minimal POSIX shell often used as /bin/sh.
zsh: Feature-rich interactive shell.

9.3 Interview Relevance

Process creation and lifecycle questions.
Parsing and system programming design trade-offs.

10. Resources

10.1 Essential Reading

“Language Implementation Patterns” by Terence Parr - focus on the chapters relevant to this project.
“Advanced Programming in the UNIX Environment” - process control and pipes.

10.2 Video Resources

Unix process model lectures (any OS course).
Compiler front-end videos for lexing/parsing projects.

10.3 Tools & Documentation

strace/dtruss: inspect syscalls.
man pages: fork, execve, waitpid, pipe, dup2.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the core concept without notes.
I can trace a command through my subsystem.
I understand at least one key design trade-off.

11.2 Implementation

All functional requirements are met.
All critical tests pass.
Edge cases are handled cleanly.

11.3 Growth

I documented lessons learned.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Core feature works for the golden demo.
Errors are handled with non-zero status.
Code is readable and buildable.

Full Completion:

All functional requirements met.
Tests cover edge cases and failures.

Excellence (Going Above & Beyond):

Performance benchmarks and clear documentation.
Behavior compared against a reference shell.

Project 3: Shell Parser (AST Builder)

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Parsing, Precedence, and AST Construction

Lexing, Quoting, and Token Boundaries

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.4 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (2-3 days)

Phase 2: Core Functionality (1 week)

Phase 3: Polish & Edge Cases (2-4 days)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria