Project 3: Filter Expression Language

Build an expression DSL with precedence-aware parsing and AST evaluation for dynamic filtering rules.

Quick Reference

Attribute Value
Difficulty Level 2: Intermediate
Time Estimate 1-2 weeks
Main Programming Language Python
Alternative Programming Languages TypeScript, Rust, Go
Coolness Level Level 3: Genuinely Clever
Business Potential 4. The “Open Core” Infrastructure
Prerequisites Project 2 parsing fundamentals, recursion, tree traversal
Key Topics Operator precedence, AST construction, interpreter semantics, extensibility

1. Learning Objectives

By completing this project, you will:

  1. Parse boolean/filter expressions into a precise AST.
  2. Implement precedence and associativity rules correctly.
  3. Evaluate expressions against structured records deterministically.
  4. Add operators without breaking grammar stability.
  5. Build a reusable expression engine for rule systems in Project 4 and Project 7.

2. All Theory Needed (Per-Concept Breakdown)

Operator Precedence and Expression Grammar Design

Fundamentals Expression languages look simple but fail quickly without clear precedence rules. AND, OR, comparison operators, and parentheses must map to unambiguous parse trees. Precedence defines which operators bind first; associativity defines how equal-precedence operators group. Without formal rules, two implementations can produce different ASTs for the same string, which breaks trust and testability. A DSL becomes useful only when authors can predict behavior. This project teaches grammar design choices that make expression evaluation deterministic and extensible.

Deep Dive into the concept Start by declaring an explicit precedence ladder. Example high to low: parentheses, unary operators, comparison (==, !=, <, >=), containment (IN, CONTAINS), logical AND, logical OR. Associativity usually left-to-right for binary operators except certain unary/binding forms. Once precedence is defined, choose parser strategy: layered recursive descent or Pratt parser.

Layered recursive descent creates one function per precedence level (parse_or, parse_and, parse_comparison, parse_primary). It is easy to read and debug but grows verbose as operators increase. Pratt parsing uses binding powers and parselets; it scales elegantly for extensible operator sets and DSL evolution. For this project, either works; if you plan many operators, Pratt is usually cleaner.

Grammar design must consider token ambiguity. IN may conflict with identifiers if case-insensitive. Decide keyword policy early (reserved keywords vs contextual keywords). Reserved keywords simplify parser logic and diagnostics, but reduce field-name freedom. Contextual keywords preserve flexibility but raise parser complexity.

Parentheses handling should be explicit in grammar and diagnostics. Missing closing parenthesis is a common authoring error; diagnostic should point to opening token and expected closure. Good error UX includes nearest unexpected token and likely fix: “Expected ) before AND”.

Precedence bugs are subtle because many tests still pass. Add intentionally ambiguous fixtures to prove behavior:

  • a OR b AND c should parse as a OR (b AND c).
  • (a OR b) AND c should differ.
  • x == 1 OR y == 2 AND z == 3 should keep AND binding tighter.

Extensibility means operator additions should be local changes. If adding NOT, you should update one precedence layer/parselet and evaluation mapping, not rewrite parser core. This is a strong architecture signal.

Consider data model constraints while designing grammar. If fields can be nested (user.role), tokenizer and parser must support dotted identifiers. If literals include strings/numbers/booleans/lists, grammar must define literal syntax clearly. Avoid “magic” implicit conversions unless necessary.

A mature expression DSL also defines truthiness and null behavior. Example policy:

  • comparisons with missing field -> false or error?
  • CONTAINS on non-list -> semantic error.
  • null equality semantics -> explicit rules. Document these policies before coding to avoid evaluator drift.

In real systems, expression grammar is often customer-facing. Deterministic behavior and clear docs lower support load significantly. This is why precedence is not an academic detail; it is product reliability.

How this fit on projects

  • Core to this project’s parser and evaluator.
  • Reused in Project 4 rule conditions.
  • Critical for Project 7 conditions and conflict checks.

Definitions & key terms

  • Precedence: ranking that decides operator binding order.
  • Associativity: grouping direction for same-precedence operators.
  • Pratt parser: expression parser based on token binding powers.
  • Parselet: handler for token in Pratt parsing.
  • Ambiguous expression: expression with multiple possible parse trees unless precedence is defined.

Mental model diagram

Expression text
   |
   v
Tokenizer --> Parser (precedence-aware) --> AST
                               |
                           unambiguous tree shape

How it works

  1. Tokenize identifiers, literals, operators, parentheses.
  2. Parse with precedence-aware functions/parselets.
  3. Build AST nodes preserving source spans.
  4. Validate operator/operand compatibility.
  5. Pass AST to evaluator.

Minimal concrete example

Input: age >= 21 AND status == "active" OR role == "admin"
AST: OR(
      AND(GTE(age,21), EQ(status,"active")),
      EQ(role,"admin")
    )

Common misconceptions

  • “Parentheses are optional if precedence is obvious.” -> obvious to author, not always to reader.
  • AND and OR can be same precedence.” -> usually wrong for user expectation.
  • “Adding operators is trivial.” -> can break grammar and evaluator guarantees.

Check-your-understanding questions

  1. Why does a OR b AND c not parse left-to-right?
  2. What changes when you add unary NOT?
  3. Predict AST for (a OR b) AND c.

Check-your-understanding answers

  1. Because AND has higher precedence.
  2. Add unary precedence level/parselet and evaluator branch.
  3. AND(OR(a,b), c).

Real-world applications

  • Search filtering in admin dashboards.
  • Alerting/monitoring condition languages.
  • Feature flag targeting rules.

Where you’ll apply it

  • §3.2 functional requirements 1-4.
  • §5.10 phase 2 parser completion.
  • Also used in Project 4 and Project 7.

References

  • Terence Parr, Language Implementation Patterns, expression parsing.
  • Robert Nystrom, Crafting Interpreters, parsing chapters.
  • Pratt parsing article: https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-parsing.html

Key insights Expression DSL correctness is mostly a grammar policy problem before it is an evaluation problem.

Summary Clear precedence and associativity rules produce predictable ASTs and make DSL behavior trustworthy.

Homework/Exercises to practice the concept

  1. Write precedence table for all planned operators.
  2. Parse three ambiguous expressions by hand.
  3. Add unary NOT to your grammar draft.

Solutions to the homework/exercises

  1. Keep table explicit and ordered high->low.
  2. Compare ASTs with and without parentheses.
  3. Place NOT above comparison, below primary grouping.

AST Evaluation and Semantic Safety

Fundamentals An AST evaluator executes language semantics independent of surface syntax. It walks tree nodes and computes boolean outcomes against a context record. Evaluator quality depends on deterministic semantics: type handling, missing fields, short-circuit rules, and operator contracts must be explicit. Without clear contracts, the same filter can behave differently across data shapes. For production use, evaluator errors should report the specific node and reason. This project builds the interpreter pattern that later scales into rule engines.

Deep Dive into the concept Design evaluator as pure function eval(node, context) -> Value|Error. Purity simplifies testing and deterministic behavior. Each AST node type maps to one evaluation rule. For binary nodes, evaluate operands and apply operator semantics. For logical operators, use short-circuit evaluation: AND stops on false left operand; OR stops on true left operand. This improves performance and avoids unnecessary errors from right branch evaluation when outcome is already decided.

Type policy is central. You can allow limited coercion ("21" to 21) or require strict typing. Strict typing is safer and easier to debug; coercion is user-friendly but can hide data quality issues. If coercion is allowed, make rules explicit and symmetrical.

Field resolution should support nested paths (user.role). On missing path segments, choose policy: return null sentinel, false for comparisons, or hard error. For operational DSLs, hard errors are often preferable during authoring, while runtime filtering may prefer false to avoid failures on partial data. Choose one and document it.

Containment operators (IN, CONTAINS) require container semantics. x IN [a,b] checks membership; tags CONTAINS "premium" checks that left operand is collection/string per policy. Invalid operand types should return semantic errors with node span.

Performance considerations: repeated evaluation over many records benefits from precomputed field accessors and operator dispatch tables. But correctness first. Optimize only after deterministic test suite exists.

Diagnostics should include expression segment and reason. Example: “SemanticError at price > "high": numeric comparison expected, got string.” This directly guides fixes.

Security concerns: if expression strings are user-supplied, never compile/evaluate via host eval. Always interpret your own AST. Limit operator/function set to approved safe operations.

Testing strategy: for each operator, create truth table fixtures. Include missing-field cases, nulls, mixed types, and short-circuit behavior. Add metamorphic tests: equivalent expressions should yield same result set.

This evaluator discipline is essential for business rules in Project 4 and inference systems in Project 7.

How this fit on projects

  • Direct evaluation engine in this project.
  • Reused for rule condition execution in Project 4.
  • Core semantic executor in Project 7.

Definitions & key terms

  • Interpreter: executes AST semantics directly.
  • Short-circuit: stop evaluating once result is determined.
  • Semantic error: structurally valid expression with invalid meaning/type.
  • Context: record/map evaluated against expression fields.
  • Purity: function behavior depends only on inputs.

Mental model diagram

AST node
  |
  v
eval(node, context)
  |-- logical node -> short-circuit branches
  |-- comparison node -> typed operator
  |-- field node -> resolve path
  '-- literal node -> value

How it works

  1. Dispatch by node type.
  2. Resolve operands recursively.
  3. Apply type checks and operator semantics.
  4. Return bool/value or semantic error with span.
  5. Aggregate results across records.

Minimal concrete example

Expression: tags CONTAINS "premium" AND age >= 21
Record: {tags:["premium"], age:22}
Result: true

Common misconceptions

  • “Parser correctness guarantees evaluator correctness.” -> semantics still need independent tests.
  • “Missing fields should silently be false always.” -> may hide data pipeline bugs.
  • “Short-circuit is just optimization.” -> it changes observable error behavior too.

Check-your-understanding questions

  1. Why avoid host-language eval for DSL execution?
  2. What is the effect of short-circuit on error propagation?
  3. How should evaluator handle age > "ten"?

Check-your-understanding answers

  1. Security and deterministic control over supported semantics.
  2. Right branch may never execute, so some errors are intentionally suppressed.
  3. Return typed semantic error with node span.

Real-world applications

  • Marketing audience segmentation.
  • Fraud rule pre-filters.
  • Access policy conditions.

Where you’ll apply it

  • §3.2 requirements 3-5.
  • §6.2 critical test cases.
  • Also used in Project 7.

References

  • Nystrom, Crafting Interpreters, tree-walk interpreter chapters.
  • Fowler rules-engine notes: https://martinfowler.com/bliki/RulesEngine.html

Key insights A safe evaluator is a deterministic semantic engine, not a string execution shortcut.

Summary AST evaluation requires explicit type, field-resolution, and short-circuit policies to stay reliable.

Homework/Exercises to practice the concept

  1. Define missing-field policy and justify it.
  2. Write truth tables for AND/OR with semantic errors.
  3. Add one custom operator and semantic contract.

Solutions to the homework/exercises

  1. Recommended default: semantic error in strict mode.
  2. false AND error -> false under short-circuit, true AND error -> error.
  3. Example STARTS_WITH(field, string) with string-only operands.

3. Project Specification

3.1 What You Will Build

An expression language engine that parses and evaluates filter conditions over JSON-like records.

Included:

  • comparisons, boolean operators, parentheses.
  • string/number/boolean literals and dotted field access.
  • evaluation engine with deterministic semantics.

Excluded:

  • user-defined functions (extension only).
  • SQL compilation backend.

3.2 Functional Requirements

  1. Tokenize expression text.
  2. Parse to precedence-correct AST.
  3. Evaluate AST against records.
  4. Support == != > >= < <= AND OR IN CONTAINS.
  5. Return actionable parse/semantic errors.

3.3 Non-Functional Requirements

  • Performance: evaluate 100k records with simple expression under 1s baseline.
  • Reliability: deterministic parse tree and evaluation result.
  • Usability: error spans point to exact token/segment.

3.4 Example Usage / Output

expr: status == "active" AND (age >= 21 OR role == "admin")
record: {status:"active", age:19, role:"admin"}
result: true

3.5 Data Formats / Schemas / Protocols

AST Node variants:
- Literal(value)
- Field(path)
- Binary(op,left,right)
- Group(inner)

3.6 Edge Cases

  • chained comparisons without boolean operator.
  • mismatched parentheses.
  • unknown operator token.
  • missing fields.
  • invalid type comparisons.

3.7 Real World Outcome

A backend can persist filter strings authored by users and evaluate them consistently against live records.

3.7.1 How to Run (Copy/Paste)

cd project_based_ideas/COMPILERS_RUNTIMES/DOMAIN_SPECIFIC_LANGUAGES_DSL_PROJECTS
make p03-test
./bin/p03-filter-eval --expr-file fixtures/p03_golden.expr --data fixtures/p03_data.json

3.7.2 Golden Path Demo (Deterministic)

Expression and dataset fixtures yield exact expected record IDs.

3.7.3 If CLI: exact terminal transcript

$ ./bin/p03-filter-eval --expr 'status == "active" AND age >= 21' --data fixtures/users.json
[ok] parsed_ast_hash=7b2a91ef
[ok] matched_ids=[u1,u4,u7]
exit=0

$ ./bin/p03-filter-eval --expr 'status == "active" AND (age > )' --data fixtures/users.json
[error] ParseError 1:33 expected literal after operator '>'
[hint] provide a number, string, or field reference
exit=2

4. Solution Architecture

4.1 High-Level Design

Expression text -> Lexer -> Parser -> AST -> Evaluator -> bool per record
                                   |
                               diagnostics

4.2 Key Components

Component Responsibility Key Decisions
Lexer produce expression tokens keyword and operator policy
Parser precedence-correct AST Pratt vs layered recursive descent
Evaluator execute semantics on records strict typing + short-circuit
Diagnostics error reporting span + hint format

4.4 Data Structures (No Full Code)

record EvalResult { ok: bool, value: bool, error?: SemanticError }
record SemanticError { node_span, message, hint }

4.4 Algorithm Overview

Key Algorithm: Pratt-style Expression Parse

  1. Parse prefix token into left expression.
  2. While next token binding power > current power, parse infix.
  3. Build binary node and continue.
  4. Return completed subtree.

Complexity Analysis

  • Parse time: O(n) tokens.
  • Eval time: O(m * t) where m records, t AST node visits per record.

5. Implementation Guide

5.1 Development Environment Setup

mkdir -p bin fixtures tests

5.2 Project Structure

p03-filter-language/
├── src/
│   ├── lexer.*
│   ├── parser.*
│   ├── ast.*
│   ├── evaluator.*
│   └── diagnostics.*
├── fixtures/
└── tests/

5.3 The Core Question You’re Answering

“How do I turn human-readable boolean logic into an unambiguous tree and evaluate it safely?”

5.4 Concepts You Must Understand First

  1. Operator precedence and associativity.
  2. AST structure and tree traversal.
  3. Type-safe evaluator design.
  4. Error spans and hinting.

5.5 Questions to Guide Your Design

  1. Which parser strategy best supports future operators?
  2. Should missing field be false or error?
  3. How do you represent dotted path resolution?
  4. How do you test short-circuit behavior explicitly?

5.6 Thinking Exercise

Manually parse: a OR b AND c OR d

Then evaluate against contexts where a=true, b=false, c=error, d=false and explain short-circuit outcomes.

5.7 The Interview Questions They’ll Ask

  1. Pratt parser vs recursive descent differences?
  2. Why precedence bugs are hard to detect?
  3. How to ensure evaluator safety for untrusted expressions?
  4. What data structures improve evaluation throughput?
  5. How would you add functions to the language?

5.8 Hints in Layers

Hint 1: Implement parser for comparisons first.

Hint 2: Add boolean operators with precedence tests.

Hint 3: Build AST pretty-printer before evaluator.

Hint 4: Add semantic strict mode switch (strict=true/false).

5.9 Books That Will Help

Topic Book Chapter
Expression parsing Language Implementation Patterns expression chapters
Interpreters Crafting Interpreters AST/evaluator chapters
DSL semantics Domain Specific Languages execution semantics

5.10 Implementation Phases

Phase 1: Foundation (2-4 hours)

  • lexer + literals + comparisons.
  • parse/evaluate simple field == value.

Checkpoint: minimal expression fixture passes.

Phase 2: Core Functionality (6-10 hours)

  • boolean operators and precedence.
  • dotted fields and containment ops.

Checkpoint: ambiguous precedence fixtures pass.

Phase 3: Polish & Edge Cases (4-8 hours)

  • semantic diagnostics and strict mode.
  • benchmark and deterministic AST snapshots.

Checkpoint: golden + failure suites stable in CI.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Parser strategy layered RD / Pratt Pratt scalable operator growth
Missing field policy false / null / error strict error default avoids silent data bugs
Evaluation mode interpret only / compile bytecode interpret first fastest path to correctness

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Parser tests AST shape correctness precedence and grouping
Evaluator tests semantic correctness type checks, short-circuit
Integration tests end-to-end filtering expression + dataset fixtures

6.2 Critical Test Cases

  1. a OR b AND c parse shape assertion.
  2. strict-mode missing field error.
  3. CONTAINS semantics on list vs scalar.
  4. short-circuit avoids right-side error evaluation.

6.3 Test Data

fixtures/p03_golden.expr
fixtures/p03_precedence.expr
fixtures/p03_bad_syntax.expr
fixtures/p03_dataset.json

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
precedence not encoded wrong matches explicit precedence tests
coercion hidden surprising comparisons strict typing with explicit conversion ops
parser/evaluator coupled hard to extend clean AST boundary

7.2 Debugging Strategies

  • AST pretty-print every parse test.
  • evaluation trace mode that prints node results.

7.3 Performance Traps

Repeated path resolution can be expensive; cache field accessor paths per AST node.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add unary NOT.
  • Add case-insensitive string comparisons.

8.2 Intermediate Extensions

  • Add function calls (len(tags) > 2).
  • compile AST to bytecode-like instruction list.

8.3 Advanced Extensions

  • expression optimizer (constant folding).
  • partial evaluation with index prefilters.

9. Real-World Connections

9.1 Industry Applications

  • CRM segmentation rules.
  • Alert policy filters.
  • Access-control decision expressions.
  • CEL (Common Expression Language): https://cel.dev/
  • JMESPath: https://jmespath.org/

9.3 Interview Relevance

  • expression parser implementation.
  • interpreter safety and determinism.

10. Resources

10.1 Essential Reading

  • Nystrom, Crafting Interpreters.
  • Parr, Language Implementation Patterns.

10.2 Video Resources

  • Pratt parser walkthroughs.
  • AST interpreter architecture talks.

10.3 Tools & Documentation

  • grammar playground tooling.
  • snapshot test frameworks.

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain precedence and associativity clearly.
  • I can trace AST evaluation on paper.
  • I can justify missing-field and type policies.

11.2 Implementation

  • precedence tests pass.
  • semantic diagnostics include spans and hints.
  • deterministic integration fixtures pass.

11.3 Growth

  • I can add one new operator safely.
  • I can discuss parser strategy tradeoffs.
  • I can map this project to business-rule systems.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • parse and evaluate comparisons with AND/OR.
  • parentheses support and two failure-path diagnostics.

Full Completion:

  • all minimum plus containment operators, strict semantic checks, deterministic fixtures.

Excellence (Going Above & Beyond):

  • function/operator extensibility architecture with benchmarks.

13 Additional Content Rules (Applied)

13.1 Determinism

Use fixed dataset and expected matched IDs snapshot.

13.2 Outcome Completeness

Provide both successful evaluation and parse failure demos with explicit exit codes.

13.3 Cross-Linking

The evaluator and parser are prerequisites for Project 4 and Project 7.

13.4 No Placeholder Text

All sections include concrete decisions and outputs.