Project 2: Configuration File Parser (External DSL - Simple)
Build an external DSL parser for an INI/TOML-like config language with typed values, sections, arrays, and actionable diagnostics.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate |
| Time Estimate | Weekend (8-14 hours) |
| Main Programming Language | C |
| Alternative Programming Languages | Rust, Go, Python |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 3. The “Service & Support” Model |
| Prerequisites | Strings, enums, data structures, basic recursion |
| Key Topics | Lexing, recursive descent parsing, typed literals, parse errors |
1. Learning Objectives
By completing this project, you will:
- Convert raw config text into tokens with full line/column tracking.
- Parse tokens into a deterministic typed configuration model.
- Separate lexical, syntactic, and semantic errors cleanly.
- Implement deterministic error messages with snippet context.
- Build foundation skills for expression parsers in Project 3.
2. All Theory Needed (Per-Concept Breakdown)
Lexical Analysis for Human-Written Config Files
Fundamentals Lexing is the first stage in an external DSL pipeline. It turns a character stream into tokens that the parser can reason about. For config languages, lexing seems simple until real files appear: comments, empty lines, escaped quotes, arrays, trailing spaces, and mixed newline conventions. Good lexers preserve source location for every token because later diagnostics depend on it. A token stream should remove irrelevant trivia (most whitespace) while preserving structural boundaries (newlines when grammar depends on them). In config DSLs, readability and error quality matter more than advanced performance. The lexer should therefore optimize for clarity, deterministic behavior, and robust edge-case handling.
Deep Dive into the concept
A practical lexer for config DSLs is often a single-pass finite-state machine with explicit modes. Typical modes: default, string literal, comment, and maybe number scanning. In default mode, each character is classified into one of: delimiter ([ ] = ,), identifier start, digit, quote, comment marker, whitespace, newline, or invalid. Delimiters become single tokens. Identifiers and numbers consume runs. String mode consumes until unescaped quote and tracks escapes. Comment mode consumes until newline, then returns to default.
Location tracking must be first-class. Maintain line, column, and offset counters. Each emitted token stores start and end position. When consuming multi-character tokens, capture start before advancing. Column resets on newline. This enables diagnostics like: “Unexpected token ] at line 14, col 9” plus caret display.
Token design should preserve semantic intent. Recommended token classes: SECTION_OPEN, SECTION_CLOSE, IDENT, EQUALS, STRING, NUMBER, BOOLEAN, COMMA, ARRAY_OPEN, ARRAY_CLOSE, NEWLINE, EOF. You can classify booleans in lexing (true/false) or defer to parser/semantic stage. Early classification simplifies parser branches but reduces flexibility if keywords evolve.
Whitespace policy is critical. Many config grammars ignore spaces around = and commas, but newline boundaries still matter for statement termination. Decide whether newlines are explicit tokens. If yes, parser can enforce one assignment per line and produce better errors around accidental token bleeding. If no, parser complexity rises for line-based semantics.
Error handling in lexing should be specific. Distinguish unterminated string, invalid escape, invalid numeric literal, and unknown character. Each has different fix guidance. Example: unterminated string should point at line start of string, not file end. Invalid escape should point to the escape sequence itself.
Determinism matters even in lexing. Given identical bytes, token sequence must be identical regardless of platform newline differences. Normalize \r\n and \r to \n in the scanner input path. This prevents test drift across OS environments.
Testing lexers should include “golden token streams”: deterministic fixtures where expected token list is stored and diffed. Add adversarial fixtures: empty file, huge line, comments only, malformed section header, nested array-like junk. Keep lexing and parsing tests separate so failures localize quickly.
This lexer mindset scales to template engines and full language tooling. In Project 6, you will reuse mode-based lexing for text/code switching. The discipline learned here pays forward.
How this fit on projects
- Used directly in this project to tokenize config source.
- Reused in Project 3 for operator/token recognition.
- Reused in Project 6 with lexer modes.
Definitions & key terms
- Token: typed unit produced by scanner.
- Lexeme: raw text slice represented by a token.
- FSM: finite-state machine controlling scanner transitions.
- Trivia: whitespace/comments not used for semantics.
- Source span: start/end location metadata.
Mental model diagram
Raw text --> Scanner FSM --> Token stream (+line/col spans) --> Parser
| |
mode switches deterministic output
How it works
- Normalize newlines.
- Iterate char-by-char with mode state.
- Emit tokens with source spans.
- Skip trivia according to policy.
- Emit EOF token.
- Fail on lexical violations with pointed diagnostics.
Minimal concrete example
Input:
[server]
port = 8080
Tokens:
SECTION_OPEN IDENT("server") SECTION_CLOSE NEWLINE
IDENT("port") EQUALS NUMBER(8080) NEWLINE EOF
Common misconceptions
- “Whitespace is always irrelevant.” -> Not if lines define statement boundaries.
- “Parser can handle bad characters later.” -> Unknown chars should fail in lexer.
- “Line numbers are optional.” -> They are mandatory for usable diagnostics.
Check-your-understanding questions
- Why normalize newline conventions before scanning?
- What error should be emitted for
name = "abc? - Predict tokenization of
debug = true.
Check-your-understanding answers
- To keep token spans deterministic across OSes.
- Unterminated string literal with start location.
IDENT(debug) EQUALS BOOLEAN(true).
Real-world applications
- Service config loaders.
- Deployment manifest validators.
- Feature-flag rule file scanners.
Where you’ll apply it
- §3.2 requirement 1 and 2.
- §5.10 Phase 1 scanner milestone.
- Also used in Project 6.
References
- Robert Nystrom, Crafting Interpreters, Ch. 4.
- Cooper & Torczon, Engineering a Compiler, lexer foundations.
- TOML spec: https://toml.io/en/
Key insights Good lexers are small, explicit state machines with precise location metadata.
Summary Lexical rigor determines whether downstream parser logic stays simple and diagnostics stay useful.
Homework/Exercises to practice the concept
- Define token classes for your grammar.
- Design error format with line/column caret.
- Create 5 malformed lexer fixtures.
Solutions to the homework/exercises
- Include delimiter, literal, identifier, newline, EOF.
- Use
line:col, snippet line, caret marker. - Include invalid char, bad escape, unterminated string, malformed number, junk after section close.
Recursive Descent Parsing and Typed Value Semantics
Fundamentals
Parsing turns token sequences into structured meaning according to grammar rules. Recursive descent parsing is a straightforward method where each grammar rule maps to a function. For config DSLs, this approach is ideal because grammar depth is modest and readability is high. Parsing alone is not enough; typed semantics matter. A string "8080" and number 8080 must stay distinct. Similarly, arrays should enforce element typing policy and preserve ordering. Separating parse stage (structure) and semantic stage (type validation/coercion) keeps logic maintainable and error messages precise.
Deep Dive into the concept Start by expressing grammar in EBNF:
- file := statement*
-
statement := section_header assignment - section_header := ‘[’ ident ‘]’
- assignment := ident ‘=’ value
-
value := string number boolean array - array := ‘[’ value (‘,’ value)* ‘]’
Recursive descent maps each nonterminal to a parse function. The parser consumes current token and advances with expectation helpers. expect(token_type) is core: if token mismatches, emit syntax error with expected/actual info and span.
State management for sections matters. Config assignment semantics usually depend on current section. Maintain parser context current_section. On section header parse, update context. On assignment parse, store key under current section or root scope.
Typed values require clear representation. Use tagged unions (sum types) like Value::String, Value::Int, Value::Bool, Value::Array. Arrays can be heterogeneous or homogeneous; choose explicitly. Homogeneous arrays simplify downstream usage and validation but are stricter. Heterogeneous arrays are flexible but harder to reason about. For this project, either is acceptable if documented.
Error recovery strategy affects user experience. Strict parsers abort on first syntax error; forgiving parsers synchronize and continue to report additional issues. For beginner-intermediate scope, implement at least basic synchronization: on assignment failure, skip tokens until newline or section boundary. This lets users fix multiple mistakes per run.
Duplicate keys and section conflicts are semantic concerns. Decide policy: fail on duplicates (recommended for safety), allow override with warning, or keep last value silently (not recommended). Explicit policy prevents configuration drift and deployment surprises.
Deterministic parse output includes deterministic key ordering strategy for serialized output. Even if map structures are unordered, maintain insertion-ordered output for stable round-trip and diff quality.
Testing parser quality requires both positive and negative fixtures. Positive fixtures should include nested arrays and mixed sections. Negative fixtures should target one failure type each (missing ], missing =, invalid boolean literal, duplicate key). Add round-trip tests: parse -> normalize -> serialize -> parse should produce equivalent model.
This parser architecture is the bridge to expression languages. In Project 3, grammar complexity rises due to precedence rules, but the same parse discipline and error interfaces apply.
How this fit on projects
- Central in this project for syntax and typed config model.
- Prepares directly for Project 3.
- Semantic policy mindset reused in Project 4.
Definitions & key terms
- Recursive descent: parsing strategy mapping grammar rules to call graph.
- Tagged union: value representation carrying variant + payload.
- Synchronization: parser recovery after error.
- Semantic validation: checks beyond grammar correctness.
- Round-trip equivalence: parse/serialize/parsing yields same structure.
Mental model diagram
Tokens --> Parse rules --> AST/Model --> Semantic checks --> Typed config object
| |
syntax errors duplicate/type errors
How it works
- Initialize parser at token index 0.
- Parse statements until EOF.
- For each statement, dispatch by lookahead token.
- Build typed value nodes.
- Apply semantic checks (duplicates, type policy).
- Emit normalized config model.
Minimal concrete example
Config:
[db]
max_connections = 10
Model:
section "db": { "max_connections": Int(10) }
Common misconceptions
- “AST is optional for small configs.” -> structured models simplify validation and serialization.
- “Type coercion should be automatic and broad.” -> over-coercion hides mistakes.
- “First error is enough.” -> multi-error reporting improves fix velocity.
Check-your-understanding questions
- Why keep parse and semantic validation separate?
- What should happen on duplicate key in same section?
- Predict parser behavior when array has trailing comma policy disabled.
Check-your-understanding answers
- It isolates syntax concerns from domain policy and improves maintainability.
- Recommended: fail with explicit duplicate key diagnostic.
- It should report syntax error at comma/closing bracket boundary.
Real-world applications
- Envoy/nginx-like config tooling.
- Infrastructure manifest validation.
- CI pipeline config loaders.
Where you’ll apply it
- §3.2 requirements 2-5.
- §6.2 critical tests.
- Also used in Project 3.
References
- Nystrom, Crafting Interpreters, Ch. 6.
- Terence Parr, Language Implementation Patterns.
- TOML specification grammar sections.
Key insights Parser simplicity comes from explicit grammar and strict data model decisions.
Summary Recursive descent provides transparent control over grammar, diagnostics, and semantic checks for external DSLs.
Homework/Exercises to practice the concept
- Write EBNF for section + assignment + array grammar.
- Decide duplicate-key policy and justify it.
- Design a sync strategy after assignment errors.
Solutions to the homework/exercises
- Use nonterminals from this section and validate with sample inputs.
- Fail-fast duplicate policy is safest for operations.
- Skip to newline or next section header token.
3. Project Specification
3.1 What You Will Build
A standalone parser tool/library that reads config files, validates syntax and semantics, and emits a typed normalized representation.
Included:
- section headers and key-value assignments.
- scalar values (string, number, boolean) and arrays.
- deterministic diagnostics with source spans.
Excluded:
- full TOML compatibility.
- datetime/inline table support.
- schema language (optional extension only).
3.2 Functional Requirements
- Tokenize source with line/column tracking.
- Parse statements into typed config model.
- Support sections, assignments, arrays, comments.
- Detect malformed syntax and duplicate keys.
- Provide normalized output representation.
- Round-trip fixtures must stay semantically equivalent.
3.3 Non-Functional Requirements
- Performance: parse 1 MB config under 150 ms on reference machine.
- Reliability: deterministic errors and normalized output across platforms.
- Usability: diagnostics include snippet and expected token context.
3.4 Example Usage / Output
$ ./bin/p02-config-parse fixtures/app.conf
[ok] Parsed file: fixtures/app.conf
[ok] Sections: server, database
[ok] server.port = Int(8080)
[ok] server.debug = Bool(true)
exit=0
3.5 Data Formats / Schemas / Protocols
ConfigModel {
sections: map<string, Section>
}
Section {
entries: ordered_map<string, Value>
}
Value = String | Int | Float | Bool | Array<Value>
3.6 Edge Cases
- section header missing closing bracket.
- assignment without value.
- unterminated string.
- duplicate keys in same section.
- mixed array types if homogeneous policy chosen.
3.7 Real World Outcome
You can confidently load human-authored configs, reject malformed files with specific guidance, and expose typed values to other systems.
3.7.1 How to Run (Copy/Paste)
cd project_based_ideas/COMPILERS_RUNTIMES/DOMAIN_SPECIFIC_LANGUAGES_DSL_PROJECTS
make p02-test
./bin/p02-config-parse fixtures/p02_golden.conf
3.7.2 Golden Path Demo (Deterministic)
Fixed fixture p02_golden.conf yields a stable normalized JSON-equivalent dump.
3.7.3 If CLI: exact terminal transcript
$ ./bin/p02-config-parse fixtures/p02_golden.conf
[ok] token_count=43
[ok] section_count=2
[ok] key_count=7
[ok] normalized_hash=5a33e196
exit=0
$ ./bin/p02-config-parse fixtures/p02_bad_unterminated.conf
[error] LexError at 6:14: unterminated string literal
[hint] close string with a matching quote (") before end-of-line
exit=2
4. Solution Architecture
4.1 High-Level Design
Source bytes
|
v
Lexer --> Token stream --> Parser --> Config model --> Normalizer
| |
Lex errors Syntax/Semantic errors
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Scanner | Produce tokens + spans | explicit FSM modes |
| Parser | Build model from tokens | recursive descent with lookahead |
| Semantic checker | duplicate/type policy | strict defaults |
| Normalizer | deterministic output | insertion order preservation |
4.4 Data Structures (No Full Code)
enum TokenType { SECTION_OPEN, SECTION_CLOSE, IDENT, EQUALS, STRING, NUMBER, BOOLEAN, COMMA, ARRAY_OPEN, ARRAY_CLOSE, NEWLINE, EOF }
record Token { type, lexeme, line, col }
record ParseError { kind, expected, found, span, hint }
4.4 Algorithm Overview
Key Algorithm: Statement Parser Loop
- Read next token.
- If section header token, parse section and set context.
- Else parse assignment into current context.
- On error, synchronize to newline/section start.
- Continue until EOF.
Complexity Analysis
- Time: O(n) tokens.
- Space: O(n) for token buffer + model.
5. Implementation Guide
5.1 Development Environment Setup
mkdir -p bin fixtures tests
5.2 Project Structure
p02-config-parser/
├── src/
│ ├── lexer.*
│ ├── parser.*
│ ├── model.*
│ └── diagnostics.*
├── fixtures/
├── tests/
└── README.md
5.3 The Core Question You’re Answering
“How do I transform human-readable configuration text into a deterministic typed model with useful errors?”
5.4 Concepts You Must Understand First
- Tokenization with source spans.
- Recursive descent parse control flow.
- Type representation via tagged unions.
- Error recovery and synchronization.
5.5 Questions to Guide Your Design
- Should booleans be tokenized or interpreted in parser?
- Where should duplicate key checks live?
- How will you keep output deterministic?
- Which errors should halt immediately vs recover?
5.6 Thinking Exercise
Given:
[server]
port = 8080
debug = tru
Draw token stream and identify where lexical vs semantic failure should trigger.
5.7 The Interview Questions They’ll Ask
- Difference between lexer and parser responsibilities?
- Why use recursive descent for small DSLs?
- How do you design parse errors users can fix quickly?
- Strict vs permissive parsing tradeoffs?
- How does this project prepare for expression grammars?
5.8 Hints in Layers
Hint 1: Build token dump command first.
Hint 2: Parse only section + string values, then add types.
Hint 3: Add synchronization before adding multi-error reporting.
Hint 4: Keep one fixture per error type.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Lexing | Crafting Interpreters | Ch. 4 |
| Parsing | Crafting Interpreters | Ch. 6 |
| Error handling | Language Implementation Patterns | parser diagnostics chapters |
5.10 Implementation Phases
Phase 1: Foundation (3-4 hours)
- Lexer with spans and token dump.
- Parser for sections + string assignments.
Checkpoint: parse minimal fixture successfully.
Phase 2: Core Functionality (4-6 hours)
- typed values and arrays.
- duplicate-key policy and semantic checks.
Checkpoint: all golden fixtures pass.
Phase 3: Polish & Edge Cases (2-4 hours)
- recovery-based error reporting.
- deterministic normalization and round-trip tests.
Checkpoint: failure fixtures emit expected diagnostics exactly.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Newline handling | tokenized / ignored | tokenized | cleaner statement boundaries |
| Duplicate key policy | override / warn / fail | fail | safer operational behavior |
| Array typing | heterogeneous / homogeneous | configurable with strict default | practical but safe |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Lexer tests | token correctness | comments, escapes, numbers |
| Parser tests | grammar correctness | sections, arrays, assignments |
| Semantic tests | policy correctness | duplicate keys, type rules |
6.2 Critical Test Cases
- Happy path config with all value types.
- Unterminated string lexical error.
- Missing section bracket syntax error.
- Duplicate key semantic error.
- Round-trip equivalence fixture.
6.3 Test Data
fixtures/p02_golden.conf
fixtures/p02_bad_unterminated.conf
fixtures/p02_bad_duplicate_key.conf
fixtures/p02_roundtrip.conf
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| lexer without spans | useless parser errors | attach line/col to every token |
| parser doing coercion everywhere | tangled logic | keep parse and semantic phases separate |
| inconsistent newline handling | platform-only failures | normalize newlines pre-scan |
7.2 Debugging Strategies
- Log tokens before parse for failing fixtures.
- Print parser rule entry/exit with token index for hard failures.
7.3 Performance Traps
Repeated substring allocations in scanner loops can dominate runtime; slice by offsets and materialize lazily when possible.
8. Extensions & Challenges
8.1 Beginner Extensions
- add inline comments after values.
- add
nullliteral support.
8.2 Intermediate Extensions
- schema validator for required keys and types.
- include/import directive with cycle detection.
8.3 Advanced Extensions
- incremental parser for live editor feedback.
- migration formatter preserving comments and order.
9. Real-World Connections
9.1 Industry Applications
- Service runtime configs.
- Infrastructure-as-code tooling.
9.2 Related Open Source Projects
- TOML: https://toml.io/en/ - config format spec.
- serde + toml-rs: https://github.com/toml-rs/toml - production parser ecosystem.
9.3 Interview Relevance
- compiler front-end fundamentals.
- parsing architecture and diagnostics.
10. Resources
10.1 Essential Reading
- Nystrom, Crafting Interpreters (scanning + parsing).
- Parr, Language Implementation Patterns.
10.2 Video Resources
- Parser construction walkthroughs (recursive descent).
- Talks on config language design tradeoffs.
10.3 Tools & Documentation
jqfor normalized output diffing.- official TOML grammar for comparison.
10.4 Related Projects in This Series
11. Self-Assessment Checklist
11.1 Understanding
- I can explain each front-end phase (lexing, parsing, semantics).
- I can justify my duplicate-key policy.
- I can trace parser control flow for one fixture.
11.2 Implementation
- parser handles supported grammar.
- deterministic diagnostics pass snapshot tests.
- round-trip fixture passes.
11.3 Growth
- I identified one grammar extension and impact.
- I can explain recovery strategy tradeoffs.
- I can map this to expression DSL design.
12. Submission / Completion Criteria
Minimum Viable Completion:
- scanner + parser for sections and assignments.
- typed values and one failure-path diagnostic.
Full Completion:
- all minimum plus arrays, semantic checks, deterministic tests.
Excellence (Going Above & Beyond):
- error recovery with multiple diagnostics per run.
- schema validation extension and benchmark notes.
13 Additional Content Rules (Applied)
13.1 Determinism
Use fixed fixtures and stable normalization order; assert output hash.
13.2 Outcome Completeness
Document both success and failure transcripts with explicit exit codes.
13.3 Cross-Linking
Concepts here feed directly into Project 3 and Project 6.
13.4 No Placeholder Text
All sections are concrete and directly actionable.