Project 4: Build a Tree-sitter Grammar

Define a complete Tree-sitter grammar and query set for a custom language and integrate it with Neovim highlighting.

Quick Reference

Attribute	Value
Difficulty	Advanced
Time Estimate	4-7 days
Main Programming Language	JavaScript (Tree-sitter DSL)
Alternative Programming Languages	None (Tree-sitter grammar DSL is JS)
Coolness Level	Level 8 - You built a parser used by the editor
Business Potential	Medium (language tooling foundation)
Prerequisites	Grammar basics, parsing concepts, Neovim Treesitter
Key Topics	CFGs, precedence, Tree-sitter DSL, queries

1. Learning Objectives

By completing this project, you will:

Design a context-free grammar for a real language.
Implement a Tree-sitter grammar with precedence and associativity.
Write highlight and fold queries for Neovim.
Handle error recovery so parsing stays usable while editing.
Test grammars with tree-sitter CLI and corpus tests.
Integrate the grammar into Neovim runtime files.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Context-Free Grammars (CFGs)

Description

A CFG defines a language by describing how tokens compose into higher-level constructs.

Definitions & Key Terms

Nonterminal: A syntactic category like expression or statement.
Terminal: A token like + or identifier.
Production: A rule like expr -> expr + term.

Mental Model Diagram (ASCII)

program
  -> statement*
statement
  -> assignment | function
expression
  -> term (('+'|'-') term)*

Grammar overview

How It Works (Step-by-Step)

Identify core syntactic forms (statements, expressions).
Define productions that describe valid compositions.
Use recursion to represent nesting.
Ensure grammar is unambiguous for common cases.

Minimal Concrete Example

expression -> term (("+"|"-") term)*
term -> number | identifier | "(" expression ")"

Expression grammar

Common Misconceptions

“CFGs define semantics” -> They define structure only.
“Left recursion is always bad” -> Tree-sitter handles it in specific ways.

Check-Your-Understanding Questions

Explain the difference between terminals and nonterminals.
Predict what parse tree is produced for 1+2*3 with naive rules.
Explain why ambiguity is a problem for incremental parsing.

Where You’ll Apply It

See §3.2 requirements and §5.10 Phase 1.
Also used in P05-lsp-server for symbol parsing.

2.2 Tree-sitter DSL and Precedence

Description

Tree-sitter uses a JavaScript DSL to describe grammar rules, precedence, and associativity.

Definitions & Key Terms

seq: Sequence of tokens or rules.
choice: Alternatives between rules.
prec: Precedence to resolve conflicts.
prec.left/right: Associativity controls.

Mental Model Diagram (ASCII)

prec.left(1, seq(expr, "+", expr))
prec.left(2, seq(expr, "*", expr))

How It Works (Step-by-Step)

Start with minimal grammar rules.
Add operator precedence with prec calls.
Use field to label important subnodes.
Run tree-sitter generate to discover conflicts.

Minimal Concrete Example

module.exports = grammar({
  name: "xyl",
  rules: {
    expression: $ => choice($.binary, $.number),
    binary: $ => prec.left(1, seq($.expression, "+", $.expression)),
    number: $ => /[0-9]+/,
  }
});

Common Misconceptions

“Precedence fixes everything” -> Some conflicts require grammar refactoring.
“Left recursion is always allowed” -> Tree-sitter limits certain patterns.

Check-Your-Understanding Questions

Explain how prec.left changes parse trees.
Predict what happens if two rules have the same precedence.
Explain why field is helpful for later tooling.

Where You’ll Apply It

See §5.10 Phase 2 and §7.1 pitfalls.
Also used in P06-neovim-lite-capstone.

2.3 Lexing, Tokens, and External Scanners

Description

Tree-sitter combines a built-in lexer with optional external scanners for complex tokens.

Definitions & Key Terms

Token: Smallest unit recognized by lexer.
External scanner: Custom lexer written in C for complex tokens.
Conflict: Overlapping token definitions.

Mental Model Diagram (ASCII)

Input stream -> Lexer -> Tokens -> Parser -> CST

How It Works (Step-by-Step)

Define regex tokens for identifiers, numbers, strings.
Use token and token.immediate for precise control.
Add an external scanner when regex is insufficient.

Minimal Concrete Example

identifier: _ => /[a-zA-Z_][a-zA-Z0-9_]*/
string: _ => /"([^"\\]|\\.)*"/

Common Misconceptions

“Regex is always enough” -> Some tokens need custom logic.
“Lexing and parsing are separate” -> Tree-sitter blends them.

Check-Your-Understanding Questions

Explain when you would need an external scanner.
Predict what happens if two tokens overlap.
Explain how token.immediate affects whitespace handling.

Where You’ll Apply It

See §3.5 data formats and §5.10 Phase 2.
Also used in P05-lsp-server for tokenization.

2.4 Error Recovery and Incremental Parsing

Description

Editors need parse trees that survive partial edits and errors.

Definitions & Key Terms

Error node: Node inserted when input does not match grammar.
Incremental parsing: Update only the affected subtree after edits.
Recovery rule: Grammar rule that allows fallback.

Mental Model Diagram (ASCII)

Before edit: [CST]
Edit region -> reparse -> [CST'] with error nodes

Incremental parse flow

How It Works (Step-by-Step)

Define grammar that accepts partial constructs.
Use ERROR nodes for recovery instead of failing entirely.
Ensure common mistakes still produce a tree.
Test by editing files with syntax errors.

Minimal Concrete Example

(statement) -> assignment | expression | ERROR

Common Misconceptions

“Errors should fail parsing” -> Editors need trees even with errors.
“Incremental parsing is automatic” -> You must design for it.

Check-Your-Understanding Questions

Explain why error nodes are useful in an editor.
Predict the parse tree for an incomplete if (.
Explain how incremental parsing improves performance.

Where You’ll Apply It

See §3.6 edge cases and §5.10 Phase 3.
Also used in P03-neovim-gui-client.

2.5 Query Files and Highlight Captures

Description

Queries map syntax nodes to highlight groups for Neovim.

Definitions & Key Terms

Query: Pattern matching syntax nodes.
Capture: Named label for matched nodes.
Highlights.scm: Query file for highlight groups.

Mental Model Diagram (ASCII)

(node) @keyword
(identifier) @variable

How It Works (Step-by-Step)

Inspect parse tree with tree-sitter parse.
Write highlight queries to capture node types.
Map captures to Neovim highlight groups.
Verify with :InspectTree in Neovim.

Minimal Concrete Example

(function_declaration name: (identifier) @function)
"if" @keyword

Common Misconceptions

“Queries are optional” -> Without them, no highlighting.
“Captures are arbitrary” -> They map to known highlight groups.

Check-Your-Understanding Questions

Explain why captures map to highlight groups.
Predict what happens if a capture name is misspelled.
Explain how to highlight only keywords.

Where You’ll Apply It

See §3.1 outcomes and §5.10 Phase 3.
Also used in P06-neovim-lite-capstone.

2.6 Testing with `tree-sitter` CLI and Corpus Tests

Description

Grammar tests ensure stability and prevent regressions.

Definitions & Key Terms

Corpus test: Input/output examples checked by Tree-sitter.
tree-sitter test: Command to validate grammar outputs.

Mental Model Diagram (ASCII)

Input file -> expected CST -> compare -> pass/fail

How It Works (Step-by-Step)

Create corpus/*.txt with input and expected CST.
Run tree-sitter test.
Fix grammar until tests pass.

Minimal Concrete Example

===
example
===
let x = 1
---
(program (assignment ...))

Example source and parse output

Common Misconceptions

“Manual inspection is enough” -> Tests catch regressions.
“Corpus tests are optional” -> They are critical for long-term stability.

Check-Your-Understanding Questions

Explain why corpus tests are important for grammar evolution.
Predict what happens if the CST changes after a rule tweak.
Explain how to update corpus tests safely.

Where You’ll Apply It

See §6 testing and §5.10 Phase 3.
Also used in P05-lsp-server.

3. Project Specification

3.1 What You Will Build

A Tree-sitter grammar for a custom language “xyl” with:

Statements, expressions, and functions
Strings, numbers, identifiers
Highlight and fold queries

Included: core grammar, queries, basic error recovery. Excluded: full language tooling or LSP features.

3.2 Functional Requirements

Grammar parses valid xyl files.
Highlight queries map keywords, strings, and functions.
Folding queries identify blocks.
Error recovery keeps CST usable with syntax errors.

3.3 Non-Functional Requirements

Performance: Incremental parsing for large files.
Reliability: No unresolved conflicts.
Usability: Clear highlight groups and folds.

3.4 Example Usage / Output

$ tree-sitter parse examples/sample.xyl
(program (function_declaration name: (identifier) ...))

3.5 Data Formats / Schemas / Protocols

grammar.js for syntax rules.
queries/xyl/highlights.scm for highlights.
queries/xyl/folds.scm for folds.

3.6 Edge Cases

Unterminated strings.
Nested expressions with ambiguous operators.
Incomplete block at EOF.

3.7 Real World Outcome

You can open .xyl files in Neovim with proper syntax highlighting.

3.7.1 How to Run (Copy/Paste)

npm install
tree-sitter generate
nvim examples/sample.xyl

3.7.2 Golden Path Demo (Deterministic)

Use examples/sample.xyl with fixed content.
Run tree-sitter parse and compare to expected output.

Expected:

Parse tree matches corpus/sample.txt exactly.

3.7.3 Failure Demo

$ tree-sitter generate
Error: Unresolved conflict for symbol "expression"
Exit code: 1

Exit codes:

0 success
1 grammar conflict
2 parse failure

4. Solution Architecture

4.1 High-Level Design

[grammar.js] -> [tree-sitter generate] -> [parser]
[queries] -> [Neovim highlight engine]

Tree-sitter generation pipeline

4.2 Key Components

Component	Responsibility	Key Decisions
Grammar	Rules + precedence	Minimal conflicts
Lexer	Tokens and keywords	Regex vs external scanner
Queries	Highlight + fold captures	Stable capture names
Tests	Corpus verification	Coverage of syntax

4.3 Data Structures (No Full Code)

CST nodes: program, function_declaration, block, expression

4.4 Algorithm Overview

Key Algorithm: Expression Precedence

Define base expressions (numbers, identifiers).
Define binary expressions with prec.left.
Ensure higher precedence operators have higher prec.

Complexity Analysis:

Time: O(n) per parse pass
Space: O(n) for CST

5. Implementation Guide

5.1 Development Environment Setup

node --version
npm --version
tree-sitter --version

5.2 Project Structure

/tree-sitter-xyl/
├── grammar.js
├── package.json
├── queries/
│   └── xyl/
│       ├── highlights.scm
│       └── folds.scm
├── corpus/
│   └── sample.txt
└── examples/
    └── sample.xyl

5.3 The Core Question You’re Answering

“How do we describe a language precisely enough that the editor can parse it in real time?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

CFG basics and ambiguity
Tree-sitter precedence rules
Query capture names

5.5 Questions to Guide Your Design

What are the minimal syntax constructs for your language?
Where is ambiguity likely (binary ops, call vs grouping)?
What highlight groups matter most for readability?

5.6 Thinking Exercise

Design a tiny task-list language and list the node types you would need:

task
tag
due_date

5.7 The Interview Questions They’ll Ask

Why is incremental parsing valuable in editors?
How do you resolve grammar conflicts in Tree-sitter?
What is the difference between CST and AST?

5.8 Hints in Layers

Hint 1: Start minimal Define only program, statement, and expression.

Hint 2: Add precedence carefully Use prec.left for operators.

Hint 3: Write corpus tests early They will guide your grammar updates.

Hint 4: Queries last Stabilize grammar before writing highlight queries.

5.9 Books That Will Help

Topic	Book	Chapter
Parsing	Engineering a Compiler	Ch. 2-4
Syntax trees	Compilers: Principles and Practice	Ch. 4
DSL design	Language Implementation Patterns	Ch. 1-2

5.10 Implementation Phases

Phase 1: Grammar Skeleton (2-3 days)

Goals:

Basic CFG
Sample parsing

Tasks:

Define tokens and identifiers.
Add core statement forms.
Generate parser and parse sample input.

Checkpoint: tree-sitter parse works on sample files.

Phase 2: Precedence and Error Recovery (2-3 days)

Goals:

Resolve conflicts
Add error nodes

Tasks:

Add precedence rules for expressions.
Introduce error recovery rules.
Update corpus tests.

Checkpoint: No unresolved conflicts.

Phase 3: Queries and Integration (1-2 days)

Goals:

Highlights and folds

Tasks:

Write highlight queries.
Add fold queries.
Load grammar in Neovim runtime.

Checkpoint: Neovim highlights .xyl correctly.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Ambiguity resolution	precedence, grammar refactor	precedence + refactor	clearer parse trees
Error handling	strict, permissive	permissive	better editor UX
Query scope	minimal, rich	minimal first	stabilize grammar

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Grammar Tests	Validate parse output	corpus tests
Integration Tests	Neovim highlighting	open files
Edge Case Tests	Broken syntax	unterminated strings

6.2 Critical Test Cases

Nested expressions with mixed precedence.
Unterminated string errors still produce CST.
Fold queries match blocks.

6.3 Test Data

corpus/
  sample.txt
  errors.txt

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Grammar conflict	`generate` fails	Add precedence or refactor
No highlighting	Queries not loaded	Check runtimepath
Overlapping tokens	Wrong parse	Adjust regexes

7.2 Debugging Strategies

Use tree-sitter parse frequently.
Inspect parse trees with :InspectTree in Neovim.
Add corpus tests for every new feature.

7.3 Performance Traps

Excessive use of choice with many alternatives.
Highly ambiguous rules causing backtracking.

8. Extensions & Challenges

8.1 Beginner Extensions

Add comments and highlight them.
Add basic string interpolation support.

8.2 Intermediate Extensions

Add external scanner for complex tokens.
Add language injection queries.

8.3 Advanced Extensions

Add error recovery tuned for common mistakes.
Generate an AST for use by an LSP.

9. Real-World Connections

9.1 Industry Applications

Syntax highlighting engines: Tree-sitter is used in Neovim, Helix.
IDE tooling: Grammars enable folding, navigation, and linting.

tree-sitter-javascript: Example of complex grammar.
nvim-treesitter: Neovim integration layer.

9.3 Interview Relevance

Compiler fundamentals: CFG, parsing, ambiguity.
Language tooling: highlighting and editor integration.

10. Resources

10.1 Essential Reading

Tree-sitter documentation (grammar and queries)
Engineering a Compiler - Ch. 2-4

10.2 Video Resources

Tree-sitter grammar walkthroughs (search: “tree-sitter grammar tutorial”)

10.3 Tools & Documentation

tree-sitter generate
tree-sitter test
Neovim :help treesitter

P05 - LSP Server: uses parse trees
P06 - Neovim Lite Capstone: integrates Tree-sitter

11. Self-Assessment Checklist

11.1 Understanding

I can explain CFGs and precedence rules.
I can write highlight queries for a CST.
I can explain incremental parsing.

11.2 Implementation

Grammar generates without conflicts.
Highlights and folds work in Neovim.
Errors do not break the CST.

11.3 Growth

I can add a new language feature end-to-end.
I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Grammar parses sample files.
Highlight queries work in Neovim.
tree-sitter test passes.

Full Completion:

All minimum criteria plus:
Error recovery handles incomplete code.
Fold queries work correctly.

Excellence (Going Above & Beyond):

External scanner for complex tokens.
Language injection queries.

13. Determinism and Reproducibility Notes

Use fixed examples/sample.xyl and corpus/sample.txt for golden demos.
Do not include timestamps in test output.
Failure demo uses a known conflict to produce a stable error.

Project 4: Build a Tree-sitter Grammar

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

2.1 Context-Free Grammars (CFGs)

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

2.2 Tree-sitter DSL and Precedence

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

2.3 Lexing, Tokens, and External Scanners

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

2.4 Error Recovery and Incremental Parsing

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

2.5 Query Files and Highlight Captures

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

2.6 Testing with tree-sitter CLI and Corpus Tests

Description

Definitions & Key Terms

Mental Model Diagram (ASCII)

How It Works (Step-by-Step)

Minimal Concrete Example

Common Misconceptions

Check-Your-Understanding Questions

Where You’ll Apply It

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 Failure Demo

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

2.6 Testing with `tree-sitter` CLI and Corpus Tests