Project 2: Configuration File Parser (External DSL - Simple)

Build an external DSL parser for an INI/TOML-like config language with typed values, sections, arrays, and actionable diagnostics.

Quick Reference

Attribute	Value
Difficulty	Level 2: Intermediate
Time Estimate	Weekend (8-14 hours)
Main Programming Language	C
Alternative Programming Languages	Rust, Go, Python
Coolness Level	Level 3: Genuinely Clever
Business Potential	3. The “Service & Support” Model
Prerequisites	Strings, enums, data structures, basic recursion
Key Topics	Lexing, recursive descent parsing, typed literals, parse errors

1. Learning Objectives

By completing this project, you will:

Convert raw config text into tokens with full line/column tracking.
Parse tokens into a deterministic typed configuration model.
Separate lexical, syntactic, and semantic errors cleanly.
Implement deterministic error messages with snippet context.
Build foundation skills for expression parsers in Project 3.

2. All Theory Needed (Per-Concept Breakdown)

Lexical Analysis for Human-Written Config Files

Fundamentals Lexing is the first stage in an external DSL pipeline. It turns a character stream into tokens that the parser can reason about. For config languages, lexing seems simple until real files appear: comments, empty lines, escaped quotes, arrays, trailing spaces, and mixed newline conventions. Good lexers preserve source location for every token because later diagnostics depend on it. A token stream should remove irrelevant trivia (most whitespace) while preserving structural boundaries (newlines when grammar depends on them). In config DSLs, readability and error quality matter more than advanced performance. The lexer should therefore optimize for clarity, deterministic behavior, and robust edge-case handling.

Deep Dive into the concept A practical lexer for config DSLs is often a single-pass finite-state machine with explicit modes. Typical modes: default, string literal, comment, and maybe number scanning. In default mode, each character is classified into one of: delimiter ([ ] = ,), identifier start, digit, quote, comment marker, whitespace, newline, or invalid. Delimiters become single tokens. Identifiers and numbers consume runs. String mode consumes until unescaped quote and tracks escapes. Comment mode consumes until newline, then returns to default.

Location tracking must be first-class. Maintain line, column, and offset counters. Each emitted token stores start and end position. When consuming multi-character tokens, capture start before advancing. Column resets on newline. This enables diagnostics like: “Unexpected token ] at line 14, col 9” plus caret display.

Token design should preserve semantic intent. Recommended token classes: SECTION_OPEN, SECTION_CLOSE, IDENT, EQUALS, STRING, NUMBER, BOOLEAN, COMMA, ARRAY_OPEN, ARRAY_CLOSE, NEWLINE, EOF. You can classify booleans in lexing (true/false) or defer to parser/semantic stage. Early classification simplifies parser branches but reduces flexibility if keywords evolve.

Whitespace policy is critical. Many config grammars ignore spaces around = and commas, but newline boundaries still matter for statement termination. Decide whether newlines are explicit tokens. If yes, parser can enforce one assignment per line and produce better errors around accidental token bleeding. If no, parser complexity rises for line-based semantics.

Error handling in lexing should be specific. Distinguish unterminated string, invalid escape, invalid numeric literal, and unknown character. Each has different fix guidance. Example: unterminated string should point at line start of string, not file end. Invalid escape should point to the escape sequence itself.

Determinism matters even in lexing. Given identical bytes, token sequence must be identical regardless of platform newline differences. Normalize \r\n and \r to \n in the scanner input path. This prevents test drift across OS environments.

Testing lexers should include “golden token streams”: deterministic fixtures where expected token list is stored and diffed. Add adversarial fixtures: empty file, huge line, comments only, malformed section header, nested array-like junk. Keep lexing and parsing tests separate so failures localize quickly.

This lexer mindset scales to template engines and full language tooling. In Project 6, you will reuse mode-based lexing for text/code switching. The discipline learned here pays forward.

How this fit on projects

Used directly in this project to tokenize config source.
Reused in Project 3 for operator/token recognition.
Reused in Project 6 with lexer modes.

Definitions & key terms

Token: typed unit produced by scanner.
Lexeme: raw text slice represented by a token.
FSM: finite-state machine controlling scanner transitions.
Trivia: whitespace/comments not used for semantics.
Source span: start/end location metadata.

Mental model diagram

Raw text --> Scanner FSM --> Token stream (+line/col spans) --> Parser
                |                 |
            mode switches      deterministic output

How it works

Normalize newlines.
Iterate char-by-char with mode state.
Emit tokens with source spans.
Skip trivia according to policy.
Emit EOF token.
Fail on lexical violations with pointed diagnostics.

Minimal concrete example

Input:
[server]
port = 8080

Tokens:
SECTION_OPEN IDENT("server") SECTION_CLOSE NEWLINE
IDENT("port") EQUALS NUMBER(8080) NEWLINE EOF

Common misconceptions

“Whitespace is always irrelevant.” -> Not if lines define statement boundaries.
“Parser can handle bad characters later.” -> Unknown chars should fail in lexer.
“Line numbers are optional.” -> They are mandatory for usable diagnostics.

Check-your-understanding questions

Why normalize newline conventions before scanning?
What error should be emitted for name = "abc?
Predict tokenization of debug = true.

Check-your-understanding answers

To keep token spans deterministic across OSes.
Unterminated string literal with start location.
IDENT(debug) EQUALS BOOLEAN(true).

Real-world applications

Service config loaders.
Deployment manifest validators.
Feature-flag rule file scanners.

Where you’ll apply it

§3.2 requirement 1 and 2.
§5.10 Phase 1 scanner milestone.
Also used in Project 6.

References

Robert Nystrom, Crafting Interpreters, Ch. 4.
Cooper & Torczon, Engineering a Compiler, lexer foundations.
TOML spec: https://toml.io/en/

Key insights Good lexers are small, explicit state machines with precise location metadata.

Summary Lexical rigor determines whether downstream parser logic stays simple and diagnostics stay useful.

Homework/Exercises to practice the concept

Define token classes for your grammar.
Design error format with line/column caret.
Create 5 malformed lexer fixtures.

Solutions to the homework/exercises

Include delimiter, literal, identifier, newline, EOF.
Use line:col, snippet line, caret marker.
Include invalid char, bad escape, unterminated string, malformed number, junk after section close.

Recursive Descent Parsing and Typed Value Semantics

Fundamentals Parsing turns token sequences into structured meaning according to grammar rules. Recursive descent parsing is a straightforward method where each grammar rule maps to a function. For config DSLs, this approach is ideal because grammar depth is modest and readability is high. Parsing alone is not enough; typed semantics matter. A string "8080" and number 8080 must stay distinct. Similarly, arrays should enforce element typing policy and preserve ordering. Separating parse stage (structure) and semantic stage (type validation/coercion) keeps logic maintainable and error messages precise.

Deep Dive into the concept Start by expressing grammar in EBNF:

file := statement*
statement := section_header assignment
section_header := ‘[’ ident ‘]’
assignment := ident ‘=’ value
value := string number boolean array
array := ‘[’ value (‘,’ value)* ‘]’

Recursive descent maps each nonterminal to a parse function. The parser consumes current token and advances with expectation helpers. expect(token_type) is core: if token mismatches, emit syntax error with expected/actual info and span.

State management for sections matters. Config assignment semantics usually depend on current section. Maintain parser context current_section. On section header parse, update context. On assignment parse, store key under current section or root scope.

Typed values require clear representation. Use tagged unions (sum types) like Value::String, Value::Int, Value::Bool, Value::Array. Arrays can be heterogeneous or homogeneous; choose explicitly. Homogeneous arrays simplify downstream usage and validation but are stricter. Heterogeneous arrays are flexible but harder to reason about. For this project, either is acceptable if documented.

Error recovery strategy affects user experience. Strict parsers abort on first syntax error; forgiving parsers synchronize and continue to report additional issues. For beginner-intermediate scope, implement at least basic synchronization: on assignment failure, skip tokens until newline or section boundary. This lets users fix multiple mistakes per run.

Duplicate keys and section conflicts are semantic concerns. Decide policy: fail on duplicates (recommended for safety), allow override with warning, or keep last value silently (not recommended). Explicit policy prevents configuration drift and deployment surprises.

Deterministic parse output includes deterministic key ordering strategy for serialized output. Even if map structures are unordered, maintain insertion-ordered output for stable round-trip and diff quality.

Testing parser quality requires both positive and negative fixtures. Positive fixtures should include nested arrays and mixed sections. Negative fixtures should target one failure type each (missing ], missing =, invalid boolean literal, duplicate key). Add round-trip tests: parse -> normalize -> serialize -> parse should produce equivalent model.

This parser architecture is the bridge to expression languages. In Project 3, grammar complexity rises due to precedence rules, but the same parse discipline and error interfaces apply.

How this fit on projects

Central in this project for syntax and typed config model.
Prepares directly for Project 3.
Semantic policy mindset reused in Project 4.

Definitions & key terms

Recursive descent: parsing strategy mapping grammar rules to call graph.
Tagged union: value representation carrying variant + payload.
Synchronization: parser recovery after error.
Semantic validation: checks beyond grammar correctness.
Round-trip equivalence: parse/serialize/parsing yields same structure.

Mental model diagram

Tokens --> Parse rules --> AST/Model --> Semantic checks --> Typed config object
             |                                |
         syntax errors                     duplicate/type errors

How it works

Initialize parser at token index 0.
Parse statements until EOF.
For each statement, dispatch by lookahead token.
Build typed value nodes.
Apply semantic checks (duplicates, type policy).
Emit normalized config model.

Minimal concrete example

Config:
[db]
max_connections = 10

Model:
section "db": { "max_connections": Int(10) }

Common misconceptions

“AST is optional for small configs.” -> structured models simplify validation and serialization.
“Type coercion should be automatic and broad.” -> over-coercion hides mistakes.
“First error is enough.” -> multi-error reporting improves fix velocity.

Check-your-understanding questions

Why keep parse and semantic validation separate?
What should happen on duplicate key in same section?
Predict parser behavior when array has trailing comma policy disabled.

Check-your-understanding answers

It isolates syntax concerns from domain policy and improves maintainability.
Recommended: fail with explicit duplicate key diagnostic.
It should report syntax error at comma/closing bracket boundary.

Real-world applications

Envoy/nginx-like config tooling.
Infrastructure manifest validation.
CI pipeline config loaders.

Where you’ll apply it

§3.2 requirements 2-5.
§6.2 critical tests.
Also used in Project 3.

References

Nystrom, Crafting Interpreters, Ch. 6.
Terence Parr, Language Implementation Patterns.
TOML specification grammar sections.

Key insights Parser simplicity comes from explicit grammar and strict data model decisions.

Summary Recursive descent provides transparent control over grammar, diagnostics, and semantic checks for external DSLs.

Homework/Exercises to practice the concept

Write EBNF for section + assignment + array grammar.
Decide duplicate-key policy and justify it.
Design a sync strategy after assignment errors.

Solutions to the homework/exercises

Use nonterminals from this section and validate with sample inputs.
Fail-fast duplicate policy is safest for operations.
Skip to newline or next section header token.

3. Project Specification

3.1 What You Will Build

A standalone parser tool/library that reads config files, validates syntax and semantics, and emits a typed normalized representation.

Included:

section headers and key-value assignments.
scalar values (string, number, boolean) and arrays.
deterministic diagnostics with source spans.

Excluded:

full TOML compatibility.
datetime/inline table support.
schema language (optional extension only).

3.2 Functional Requirements

Tokenize source with line/column tracking.
Parse statements into typed config model.
Support sections, assignments, arrays, comments.
Detect malformed syntax and duplicate keys.
Provide normalized output representation.
Round-trip fixtures must stay semantically equivalent.

3.3 Non-Functional Requirements

Performance: parse 1 MB config under 150 ms on reference machine.
Reliability: deterministic errors and normalized output across platforms.
Usability: diagnostics include snippet and expected token context.

3.4 Example Usage / Output

$ ./bin/p02-config-parse fixtures/app.conf
[ok] Parsed file: fixtures/app.conf
[ok] Sections: server, database
[ok] server.port = Int(8080)
[ok] server.debug = Bool(true)
exit=0

3.5 Data Formats / Schemas / Protocols

ConfigModel {
  sections: map<string, Section>
}
Section {
  entries: ordered_map<string, Value>
}
Value = String | Int | Float | Bool | Array<Value>

3.6 Edge Cases

section header missing closing bracket.
assignment without value.
unterminated string.
duplicate keys in same section.
mixed array types if homogeneous policy chosen.

3.7 Real World Outcome

You can confidently load human-authored configs, reject malformed files with specific guidance, and expose typed values to other systems.

3.7.1 How to Run (Copy/Paste)

cd project_based_ideas/COMPILERS_RUNTIMES/DOMAIN_SPECIFIC_LANGUAGES_DSL_PROJECTS
make p02-test
./bin/p02-config-parse fixtures/p02_golden.conf

3.7.2 Golden Path Demo (Deterministic)

Fixed fixture p02_golden.conf yields a stable normalized JSON-equivalent dump.

3.7.3 If CLI: exact terminal transcript

$ ./bin/p02-config-parse fixtures/p02_golden.conf
[ok] token_count=43
[ok] section_count=2
[ok] key_count=7
[ok] normalized_hash=5a33e196
exit=0

$ ./bin/p02-config-parse fixtures/p02_bad_unterminated.conf
[error] LexError at 6:14: unterminated string literal
[hint] close string with a matching quote (") before end-of-line
exit=2

4. Solution Architecture

4.1 High-Level Design

Source bytes
   |
   v
Lexer --> Token stream --> Parser --> Config model --> Normalizer
  |                           |
Lex errors                 Syntax/Semantic errors

4.2 Key Components

Component	Responsibility	Key Decisions
Scanner	Produce tokens + spans	explicit FSM modes
Parser	Build model from tokens	recursive descent with lookahead
Semantic checker	duplicate/type policy	strict defaults
Normalizer	deterministic output	insertion order preservation

4.4 Data Structures (No Full Code)

enum TokenType { SECTION_OPEN, SECTION_CLOSE, IDENT, EQUALS, STRING, NUMBER, BOOLEAN, COMMA, ARRAY_OPEN, ARRAY_CLOSE, NEWLINE, EOF }
record Token { type, lexeme, line, col }
record ParseError { kind, expected, found, span, hint }

4.4 Algorithm Overview

Key Algorithm: Statement Parser Loop

Read next token.
If section header token, parse section and set context.
Else parse assignment into current context.
On error, synchronize to newline/section start.
Continue until EOF.

Complexity Analysis

Time: O(n) tokens.
Space: O(n) for token buffer + model.

5. Implementation Guide

5.1 Development Environment Setup

mkdir -p bin fixtures tests

5.2 Project Structure

p02-config-parser/
├── src/
│   ├── lexer.*
│   ├── parser.*
│   ├── model.*
│   └── diagnostics.*
├── fixtures/
├── tests/
└── README.md

5.3 The Core Question You’re Answering

“How do I transform human-readable configuration text into a deterministic typed model with useful errors?”

5.4 Concepts You Must Understand First

Tokenization with source spans.
Recursive descent parse control flow.
Type representation via tagged unions.
Error recovery and synchronization.

5.5 Questions to Guide Your Design

Should booleans be tokenized or interpreted in parser?
Where should duplicate key checks live?
How will you keep output deterministic?
Which errors should halt immediately vs recover?

5.6 Thinking Exercise

Given:

[server]
port = 8080
debug = tru

Draw token stream and identify where lexical vs semantic failure should trigger.

5.7 The Interview Questions They’ll Ask

Difference between lexer and parser responsibilities?
Why use recursive descent for small DSLs?
How do you design parse errors users can fix quickly?
Strict vs permissive parsing tradeoffs?
How does this project prepare for expression grammars?

5.8 Hints in Layers

Hint 1: Build token dump command first.

Hint 2: Parse only section + string values, then add types.

Hint 3: Add synchronization before adding multi-error reporting.

Hint 4: Keep one fixture per error type.

5.9 Books That Will Help

Topic	Book	Chapter
Lexing	Crafting Interpreters	Ch. 4
Parsing	Crafting Interpreters	Ch. 6
Error handling	Language Implementation Patterns	parser diagnostics chapters

5.10 Implementation Phases

Phase 1: Foundation (3-4 hours)

Lexer with spans and token dump.
Parser for sections + string assignments.

Checkpoint: parse minimal fixture successfully.

Phase 2: Core Functionality (4-6 hours)

typed values and arrays.
duplicate-key policy and semantic checks.

Checkpoint: all golden fixtures pass.

Phase 3: Polish & Edge Cases (2-4 hours)

recovery-based error reporting.
deterministic normalization and round-trip tests.

Checkpoint: failure fixtures emit expected diagnostics exactly.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Newline handling	tokenized / ignored	tokenized	cleaner statement boundaries
Duplicate key policy	override / warn / fail	fail	safer operational behavior
Array typing	heterogeneous / homogeneous	configurable with strict default	practical but safe

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Lexer tests	token correctness	comments, escapes, numbers
Parser tests	grammar correctness	sections, arrays, assignments
Semantic tests	policy correctness	duplicate keys, type rules

6.2 Critical Test Cases

Happy path config with all value types.
Unterminated string lexical error.
Missing section bracket syntax error.
Duplicate key semantic error.
Round-trip equivalence fixture.

6.3 Test Data

fixtures/p02_golden.conf
fixtures/p02_bad_unterminated.conf
fixtures/p02_bad_duplicate_key.conf
fixtures/p02_roundtrip.conf

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
lexer without spans	useless parser errors	attach line/col to every token
parser doing coercion everywhere	tangled logic	keep parse and semantic phases separate
inconsistent newline handling	platform-only failures	normalize newlines pre-scan

7.2 Debugging Strategies

Log tokens before parse for failing fixtures.
Print parser rule entry/exit with token index for hard failures.

7.3 Performance Traps

Repeated substring allocations in scanner loops can dominate runtime; slice by offsets and materialize lazily when possible.

8. Extensions & Challenges

8.1 Beginner Extensions

add inline comments after values.
add null literal support.

8.2 Intermediate Extensions

schema validator for required keys and types.
include/import directive with cycle detection.

8.3 Advanced Extensions

incremental parser for live editor feedback.
migration formatter preserving comments and order.

9. Real-World Connections

9.1 Industry Applications

Service runtime configs.
Infrastructure-as-code tooling.

TOML: https://toml.io/en/ - config format spec.
serde + toml-rs: https://github.com/toml-rs/toml - production parser ecosystem.

9.3 Interview Relevance

compiler front-end fundamentals.
parsing architecture and diagnostics.

10. Resources

10.1 Essential Reading

Nystrom, Crafting Interpreters (scanning + parsing).
Parr, Language Implementation Patterns.

10.2 Video Resources

Parser construction walkthroughs (recursive descent).
Talks on config language design tradeoffs.

10.3 Tools & Documentation

jq for normalized output diffing.
official TOML grammar for comparison.

11. Self-Assessment Checklist

11.1 Understanding

I can explain each front-end phase (lexing, parsing, semantics).
I can justify my duplicate-key policy.
I can trace parser control flow for one fixture.

11.2 Implementation

parser handles supported grammar.
deterministic diagnostics pass snapshot tests.
round-trip fixture passes.

11.3 Growth

I identified one grammar extension and impact.
I can explain recovery strategy tradeoffs.
I can map this to expression DSL design.

12. Submission / Completion Criteria

Minimum Viable Completion:

scanner + parser for sections and assignments.
typed values and one failure-path diagnostic.

Full Completion:

all minimum plus arrays, semantic checks, deterministic tests.

Excellence (Going Above & Beyond):

error recovery with multiple diagnostics per run.
schema validation extension and benchmark notes.

13 Additional Content Rules (Applied)

13.1 Determinism

Use fixed fixtures and stable normalization order; assert output hash.

13.2 Outcome Completeness

Document both success and failure transcripts with explicit exit codes.

13.3 Cross-Linking

Concepts here feed directly into Project 3 and Project 6.

13.4 No Placeholder Text

All sections are concrete and directly actionable.