Project 4: Build a Tree-sitter Grammar

Define a complete Tree-sitter grammar and query set for a custom language and integrate it with Neovim highlighting.

Quick Reference

Attribute Value
Difficulty Advanced
Time Estimate 4-7 days
Main Programming Language JavaScript (Tree-sitter DSL)
Alternative Programming Languages None (Tree-sitter grammar DSL is JS)
Coolness Level Level 8 - You built a parser used by the editor
Business Potential Medium (language tooling foundation)
Prerequisites Grammar basics, parsing concepts, Neovim Treesitter
Key Topics CFGs, precedence, Tree-sitter DSL, queries

1. Learning Objectives

By completing this project, you will:

  1. Design a context-free grammar for a real language.
  2. Implement a Tree-sitter grammar with precedence and associativity.
  3. Write highlight and fold queries for Neovim.
  4. Handle error recovery so parsing stays usable while editing.
  5. Test grammars with tree-sitter CLI and corpus tests.
  6. Integrate the grammar into Neovim runtime files.

2. All Theory Needed (Per-Concept Breakdown)

2.1 Context-Free Grammars (CFGs)

Description

A CFG defines a language by describing how tokens compose into higher-level constructs.

Definitions & Key Terms
  • Nonterminal: A syntactic category like expression or statement.
  • Terminal: A token like + or identifier.
  • Production: A rule like expr -> expr + term.
Mental Model Diagram (ASCII)
program
  -> statement*
statement
  -> assignment | function
expression
  -> term (('+'|'-') term)*

Grammar overview

How It Works (Step-by-Step)
  1. Identify core syntactic forms (statements, expressions).
  2. Define productions that describe valid compositions.
  3. Use recursion to represent nesting.
  4. Ensure grammar is unambiguous for common cases.
Minimal Concrete Example
expression -> term (("+"|"-") term)*
term -> number | identifier | "(" expression ")"

Expression grammar

Common Misconceptions
  • “CFGs define semantics” -> They define structure only.
  • “Left recursion is always bad” -> Tree-sitter handles it in specific ways.
Check-Your-Understanding Questions
  1. Explain the difference between terminals and nonterminals.
  2. Predict what parse tree is produced for 1+2*3 with naive rules.
  3. Explain why ambiguity is a problem for incremental parsing.
Where You’ll Apply It
  • See §3.2 requirements and §5.10 Phase 1.
  • Also used in P05-lsp-server for symbol parsing.

2.2 Tree-sitter DSL and Precedence

Description

Tree-sitter uses a JavaScript DSL to describe grammar rules, precedence, and associativity.

Definitions & Key Terms
  • seq: Sequence of tokens or rules.
  • choice: Alternatives between rules.
  • prec: Precedence to resolve conflicts.
  • prec.left/right: Associativity controls.
Mental Model Diagram (ASCII)
prec.left(1, seq(expr, "+", expr))
prec.left(2, seq(expr, "*", expr))
How It Works (Step-by-Step)
  1. Start with minimal grammar rules.
  2. Add operator precedence with prec calls.
  3. Use field to label important subnodes.
  4. Run tree-sitter generate to discover conflicts.
Minimal Concrete Example
module.exports = grammar({
  name: "xyl",
  rules: {
    expression: $ => choice($.binary, $.number),
    binary: $ => prec.left(1, seq($.expression, "+", $.expression)),
    number: $ => /[0-9]+/,
  }
});
Common Misconceptions
  • “Precedence fixes everything” -> Some conflicts require grammar refactoring.
  • “Left recursion is always allowed” -> Tree-sitter limits certain patterns.
Check-Your-Understanding Questions
  1. Explain how prec.left changes parse trees.
  2. Predict what happens if two rules have the same precedence.
  3. Explain why field is helpful for later tooling.
Where You’ll Apply It

2.3 Lexing, Tokens, and External Scanners

Description

Tree-sitter combines a built-in lexer with optional external scanners for complex tokens.

Definitions & Key Terms
  • Token: Smallest unit recognized by lexer.
  • External scanner: Custom lexer written in C for complex tokens.
  • Conflict: Overlapping token definitions.
Mental Model Diagram (ASCII)
Input stream -> Lexer -> Tokens -> Parser -> CST
How It Works (Step-by-Step)
  1. Define regex tokens for identifiers, numbers, strings.
  2. Use token and token.immediate for precise control.
  3. Add an external scanner when regex is insufficient.
Minimal Concrete Example
identifier: _ => /[a-zA-Z_][a-zA-Z0-9_]*/
string: _ => /"([^"\\]|\\.)*"/
Common Misconceptions
  • “Regex is always enough” -> Some tokens need custom logic.
  • “Lexing and parsing are separate” -> Tree-sitter blends them.
Check-Your-Understanding Questions
  1. Explain when you would need an external scanner.
  2. Predict what happens if two tokens overlap.
  3. Explain how token.immediate affects whitespace handling.
Where You’ll Apply It
  • See §3.5 data formats and §5.10 Phase 2.
  • Also used in P05-lsp-server for tokenization.

2.4 Error Recovery and Incremental Parsing

Description

Editors need parse trees that survive partial edits and errors.

Definitions & Key Terms
  • Error node: Node inserted when input does not match grammar.
  • Incremental parsing: Update only the affected subtree after edits.
  • Recovery rule: Grammar rule that allows fallback.
Mental Model Diagram (ASCII)
Before edit: [CST]
Edit region -> reparse -> [CST'] with error nodes

Incremental parse flow

How It Works (Step-by-Step)
  1. Define grammar that accepts partial constructs.
  2. Use ERROR nodes for recovery instead of failing entirely.
  3. Ensure common mistakes still produce a tree.
  4. Test by editing files with syntax errors.
Minimal Concrete Example
(statement) -> assignment | expression | ERROR
Common Misconceptions
  • “Errors should fail parsing” -> Editors need trees even with errors.
  • “Incremental parsing is automatic” -> You must design for it.
Check-Your-Understanding Questions
  1. Explain why error nodes are useful in an editor.
  2. Predict the parse tree for an incomplete if (.
  3. Explain how incremental parsing improves performance.
Where You’ll Apply It

2.5 Query Files and Highlight Captures

Description

Queries map syntax nodes to highlight groups for Neovim.

Definitions & Key Terms
  • Query: Pattern matching syntax nodes.
  • Capture: Named label for matched nodes.
  • Highlights.scm: Query file for highlight groups.
Mental Model Diagram (ASCII)
(node) @keyword
(identifier) @variable
How It Works (Step-by-Step)
  1. Inspect parse tree with tree-sitter parse.
  2. Write highlight queries to capture node types.
  3. Map captures to Neovim highlight groups.
  4. Verify with :InspectTree in Neovim.
Minimal Concrete Example
(function_declaration name: (identifier) @function)
"if" @keyword
Common Misconceptions
  • “Queries are optional” -> Without them, no highlighting.
  • “Captures are arbitrary” -> They map to known highlight groups.
Check-Your-Understanding Questions
  1. Explain why captures map to highlight groups.
  2. Predict what happens if a capture name is misspelled.
  3. Explain how to highlight only keywords.
Where You’ll Apply It

2.6 Testing with tree-sitter CLI and Corpus Tests

Description

Grammar tests ensure stability and prevent regressions.

Definitions & Key Terms
  • Corpus test: Input/output examples checked by Tree-sitter.
  • tree-sitter test: Command to validate grammar outputs.
Mental Model Diagram (ASCII)
Input file -> expected CST -> compare -> pass/fail
How It Works (Step-by-Step)
  1. Create corpus/*.txt with input and expected CST.
  2. Run tree-sitter test.
  3. Fix grammar until tests pass.
Minimal Concrete Example
===
example
===
let x = 1
---
(program (assignment ...))

Example source and parse output

Common Misconceptions
  • “Manual inspection is enough” -> Tests catch regressions.
  • “Corpus tests are optional” -> They are critical for long-term stability.
Check-Your-Understanding Questions
  1. Explain why corpus tests are important for grammar evolution.
  2. Predict what happens if the CST changes after a rule tweak.
  3. Explain how to update corpus tests safely.
Where You’ll Apply It

3. Project Specification

3.1 What You Will Build

A Tree-sitter grammar for a custom language “xyl” with:

  • Statements, expressions, and functions
  • Strings, numbers, identifiers
  • Highlight and fold queries

Included: core grammar, queries, basic error recovery. Excluded: full language tooling or LSP features.

3.2 Functional Requirements

  1. Grammar parses valid xyl files.
  2. Highlight queries map keywords, strings, and functions.
  3. Folding queries identify blocks.
  4. Error recovery keeps CST usable with syntax errors.

3.3 Non-Functional Requirements

  • Performance: Incremental parsing for large files.
  • Reliability: No unresolved conflicts.
  • Usability: Clear highlight groups and folds.

3.4 Example Usage / Output

$ tree-sitter parse examples/sample.xyl
(program (function_declaration name: (identifier) ...))

3.5 Data Formats / Schemas / Protocols

  • grammar.js for syntax rules.
  • queries/xyl/highlights.scm for highlights.
  • queries/xyl/folds.scm for folds.

3.6 Edge Cases

  • Unterminated strings.
  • Nested expressions with ambiguous operators.
  • Incomplete block at EOF.

3.7 Real World Outcome

You can open .xyl files in Neovim with proper syntax highlighting.

3.7.1 How to Run (Copy/Paste)

npm install
tree-sitter generate
nvim examples/sample.xyl

3.7.2 Golden Path Demo (Deterministic)

  • Use examples/sample.xyl with fixed content.
  • Run tree-sitter parse and compare to expected output.

Expected:

  • Parse tree matches corpus/sample.txt exactly.

3.7.3 Failure Demo

$ tree-sitter generate
Error: Unresolved conflict for symbol "expression"
Exit code: 1

Exit codes:

  • 0 success
  • 1 grammar conflict
  • 2 parse failure

4. Solution Architecture

4.1 High-Level Design

[grammar.js] -> [tree-sitter generate] -> [parser]
[queries] -> [Neovim highlight engine]

Tree-sitter generation pipeline

4.2 Key Components

Component Responsibility Key Decisions
Grammar Rules + precedence Minimal conflicts
Lexer Tokens and keywords Regex vs external scanner
Queries Highlight + fold captures Stable capture names
Tests Corpus verification Coverage of syntax

4.3 Data Structures (No Full Code)

CST nodes: program, function_declaration, block, expression

4.4 Algorithm Overview

Key Algorithm: Expression Precedence

  1. Define base expressions (numbers, identifiers).
  2. Define binary expressions with prec.left.
  3. Ensure higher precedence operators have higher prec.

Complexity Analysis:

  • Time: O(n) per parse pass
  • Space: O(n) for CST

5. Implementation Guide

5.1 Development Environment Setup

node --version
npm --version
tree-sitter --version

5.2 Project Structure

/tree-sitter-xyl/
├── grammar.js
├── package.json
├── queries/
│   └── xyl/
│       ├── highlights.scm
│       └── folds.scm
├── corpus/
│   └── sample.txt
└── examples/
    └── sample.xyl

5.3 The Core Question You’re Answering

“How do we describe a language precisely enough that the editor can parse it in real time?”

5.4 Concepts You Must Understand First

Stop and research these before coding:

  1. CFG basics and ambiguity
  2. Tree-sitter precedence rules
  3. Query capture names

5.5 Questions to Guide Your Design

  1. What are the minimal syntax constructs for your language?
  2. Where is ambiguity likely (binary ops, call vs grouping)?
  3. What highlight groups matter most for readability?

5.6 Thinking Exercise

Design a tiny task-list language and list the node types you would need:

  • task
  • tag
  • due_date

5.7 The Interview Questions They’ll Ask

  1. Why is incremental parsing valuable in editors?
  2. How do you resolve grammar conflicts in Tree-sitter?
  3. What is the difference between CST and AST?

5.8 Hints in Layers

Hint 1: Start minimal Define only program, statement, and expression.

Hint 2: Add precedence carefully Use prec.left for operators.

Hint 3: Write corpus tests early They will guide your grammar updates.

Hint 4: Queries last Stabilize grammar before writing highlight queries.

5.9 Books That Will Help

Topic Book Chapter
Parsing Engineering a Compiler Ch. 2-4
Syntax trees Compilers: Principles and Practice Ch. 4
DSL design Language Implementation Patterns Ch. 1-2

5.10 Implementation Phases

Phase 1: Grammar Skeleton (2-3 days)

Goals:

  • Basic CFG
  • Sample parsing

Tasks:

  1. Define tokens and identifiers.
  2. Add core statement forms.
  3. Generate parser and parse sample input.

Checkpoint: tree-sitter parse works on sample files.

Phase 2: Precedence and Error Recovery (2-3 days)

Goals:

  • Resolve conflicts
  • Add error nodes

Tasks:

  1. Add precedence rules for expressions.
  2. Introduce error recovery rules.
  3. Update corpus tests.

Checkpoint: No unresolved conflicts.

Phase 3: Queries and Integration (1-2 days)

Goals:

  • Highlights and folds

Tasks:

  1. Write highlight queries.
  2. Add fold queries.
  3. Load grammar in Neovim runtime.

Checkpoint: Neovim highlights .xyl correctly.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Ambiguity resolution precedence, grammar refactor precedence + refactor clearer parse trees
Error handling strict, permissive permissive better editor UX
Query scope minimal, rich minimal first stabilize grammar

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Grammar Tests Validate parse output corpus tests
Integration Tests Neovim highlighting open files
Edge Case Tests Broken syntax unterminated strings

6.2 Critical Test Cases

  1. Nested expressions with mixed precedence.
  2. Unterminated string errors still produce CST.
  3. Fold queries match blocks.

6.3 Test Data

corpus/
  sample.txt
  errors.txt

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Grammar conflict generate fails Add precedence or refactor
No highlighting Queries not loaded Check runtimepath
Overlapping tokens Wrong parse Adjust regexes

7.2 Debugging Strategies

  • Use tree-sitter parse frequently.
  • Inspect parse trees with :InspectTree in Neovim.
  • Add corpus tests for every new feature.

7.3 Performance Traps

  • Excessive use of choice with many alternatives.
  • Highly ambiguous rules causing backtracking.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add comments and highlight them.
  • Add basic string interpolation support.

8.2 Intermediate Extensions

  • Add external scanner for complex tokens.
  • Add language injection queries.

8.3 Advanced Extensions

  • Add error recovery tuned for common mistakes.
  • Generate an AST for use by an LSP.

9. Real-World Connections

9.1 Industry Applications

  • Syntax highlighting engines: Tree-sitter is used in Neovim, Helix.
  • IDE tooling: Grammars enable folding, navigation, and linting.
  • tree-sitter-javascript: Example of complex grammar.
  • nvim-treesitter: Neovim integration layer.

9.3 Interview Relevance

  • Compiler fundamentals: CFG, parsing, ambiguity.
  • Language tooling: highlighting and editor integration.

10. Resources

10.1 Essential Reading

  • Tree-sitter documentation (grammar and queries)
  • Engineering a Compiler - Ch. 2-4

10.2 Video Resources

  • Tree-sitter grammar walkthroughs (search: “tree-sitter grammar tutorial”)

10.3 Tools & Documentation

  • tree-sitter generate
  • tree-sitter test
  • Neovim :help treesitter

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain CFGs and precedence rules.
  • I can write highlight queries for a CST.
  • I can explain incremental parsing.

11.2 Implementation

  • Grammar generates without conflicts.
  • Highlights and folds work in Neovim.
  • Errors do not break the CST.

11.3 Growth

  • I can add a new language feature end-to-end.
  • I can explain this project in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Grammar parses sample files.
  • Highlight queries work in Neovim.
  • tree-sitter test passes.

Full Completion:

  • All minimum criteria plus:
  • Error recovery handles incomplete code.
  • Fold queries work correctly.

Excellence (Going Above & Beyond):

  • External scanner for complex tokens.
  • Language injection queries.

13. Determinism and Reproducibility Notes

  • Use fixed examples/sample.xyl and corpus/sample.txt for golden demos.
  • Do not include timestamps in test output.
  • Failure demo uses a known conflict to produce a stable error.