Project 8: Prompt DSL + Linter
Lint reports, style gates, and policy rule violations in CI.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3: Advanced |
| Time Estimate | 5-10 days (capstone: 3-5 weeks) |
| Main Programming Language | TypeScript |
| Alternative Programming Languages | Python, Rust |
| Coolness Level | Level 4: Platform Builder |
| Business Potential | 4. Developer Tooling |
| Knowledge Area | Prompt Tooling |
| Software or Tool | DSL parser + static checks |
| Main Book | Language Implementation Patterns (Parr) |
| Concept Clusters | Tool Calling and MCP Interoperability; Prompt Contracts and Output Typing |
1. Learning Objectives
By completing this project, you will:
- Design a formal grammar (BNF/EBNF) for a prompt DSL that captures metadata, sections, output schema references, and policy annotations.
- Implement a lexer and recursive-descent parser that converts
.promptfiles into a typed abstract syntax tree (AST). - Build a rule engine using the visitor pattern over ASTs that classifies findings by severity (error, warning, info) and category (style, safety, maintainability).
- Produce SARIF-formatted lint reports consumable by GitHub Code Scanning, VS Code, and CI pipelines.
- Integrate prompt linting as a merge-blocking gate in a CI pipeline with version-tracked rule sets.
- Design audit trails that link every lint finding to a rule version, prompt file version, and pipeline run ID.
2. All Theory Needed (Per-Concept Breakdown)
Grammar Design for Prompt DSLs
Fundamentals A grammar is a set of formal rules that defines which strings belong to a language. For a prompt DSL, the grammar specifies the legal structure of prompt files: what sections exist, how metadata is declared, where output schema references appear, and which tokens are meaningful versus decorative. Without a grammar, prompt files are just free-form text that no tool can reliably parse, validate, or transform. Grammar design is the first act of this project because every downstream capability (linting, refactoring, diff-analysis) depends on having a deterministic parse tree. You are not building a general-purpose programming language; you are building a structured document format with enough formality that machines can reason about prompt content before it ever reaches an LLM.
Deep Dive into the concept Designing a grammar for prompt files requires deciding on three layers: lexical structure (tokens), syntactic structure (parse rules), and semantic constraints (what combinations are meaningful).
At the lexical level, you define token types. A prompt DSL typically needs: FRONTMATTER_DELIMITER (the --- markers), KEY, COLON, VALUE, SECTION_HEADER (like ## system or ## user), OUTPUT_SCHEMA_REF (a reference like @schema:OrderResponse), POLICY_ANNOTATION (like @policy:no-pii), TEXT_BLOCK (free-form prompt content), and COMMENT. The lexer scans the source file character by character and emits a stream of these tokens. A critical design choice is whether your lexer is context-free or context-sensitive. Frontmatter parsing often requires a modal lexer: between the first and second --- delimiters, the lexer emits KEY/COLON/VALUE tokens; outside that region, it emits section headers and text blocks.
At the syntactic level, you write production rules. These are typically expressed in BNF or EBNF notation. Each rule describes how tokens combine into larger structures. For example, a prompt file is a sequence: optional frontmatter block, followed by one or more sections, where each section has a header and a body. The parser reads the token stream and builds an Abstract Syntax Tree (AST) where each node represents a structural element. The AST is the central data structure for all downstream analysis.
Two main parser implementation strategies apply here. Recursive descent is the simplest: you write one function per grammar rule, and each function consumes tokens and returns an AST node. This approach is easy to debug and produces excellent error messages because you always know which rule you are in when a parse error occurs. PEG (Parsing Expression Grammar) parsers are an alternative that handles ambiguity through ordered choice: the first matching alternative wins. PEG parsers are often generated from a grammar specification file, which means less hand-written code but sometimes harder-to-control error messages.
Semantic constraints go beyond syntax. A file might parse correctly but violate rules like “every prompt file must have an output_schema key in frontmatter” or “the system section must appear before the user section.” These constraints are not part of the grammar itself; they are checked by walking the AST after parsing. This separation matters: syntax errors are reported by the parser, semantic errors are reported by analysis passes.
AST node types for a prompt DSL typically include: PromptFile (root), FrontmatterBlock, MetadataEntry (key-value pair), Section (header + body), TextBlock, SchemaReference, PolicyAnnotation, IncludeDirective (if the DSL supports composition), and Comment. Each node carries source location metadata (file path, line number, column number) so that lint findings can point to exact positions.
Error recovery is a design challenge. If the parser encounters an unexpected token, should it abort or try to recover and continue parsing? For a linter, recovery is important because you want to report as many issues as possible in a single run, not just the first syntax error. A common recovery strategy is “panic mode”: skip tokens until you find a synchronization point (like the next section header), then resume parsing from there.
How this fit on projects Grammar design is the foundation of Project 8. The grammar specification defines what the parser accepts, which determines what the AST looks like, which determines what lint rules can inspect. Every other component in the project depends on the quality and completeness of the grammar.
Definitions & key terms
- BNF/EBNF: Backus-Naur Form / Extended BNF. Notation systems for writing context-free grammars. EBNF adds repetition (
*,+) and optional (?) operators. - Token: The smallest meaningful unit produced by the lexer (e.g., a keyword, a delimiter, a text block).
- AST (Abstract Syntax Tree): A tree representation of parsed source where each node corresponds to a grammar construct. Unlike a parse tree, the AST omits syntactically redundant tokens.
- Recursive descent parser: A top-down parser built from mutually recursive functions, one per grammar rule.
- PEG (Parsing Expression Grammar): A parsing formalism where alternatives are tried in order; the first match wins.
- Source location: File path, line, and column attached to every AST node for diagnostics.
Mental model diagram (ASCII)
Source .prompt File
|
v
+------------------+
| Lexer | character stream -> token stream
+------------------+
|
v
[FRONT_DELIM] [KEY:model] [COLON] [VALUE:gpt-4]
[FRONT_DELIM] [SECTION:system] [TEXT_BLOCK:...]
[SCHEMA_REF:@schema:Order] [SECTION:user] ...
|
v
+------------------+
| Parser | token stream -> AST
| (recursive |
| descent / PEG) |
+------------------+
|
v
PromptFile (root)
├── FrontmatterBlock
│ ├── MetadataEntry(model, "gpt-4")
│ ├── MetadataEntry(version, "2.1")
│ └── MetadataEntry(output_schema, "Order")
├── Section(system)
│ └── TextBlock("You are a helpful...")
├── Section(user)
│ └── TextBlock("{{user_query}}")
└── PolicyAnnotation(@policy:no-pii)
|
v
+------------------+
| Analysis Passes | AST -> Findings
+------------------+
How it works (step-by-step, with invariants and failure modes)
- The lexer reads the
.promptfile and emits a token stream. Invariant: every byte in the source is accounted for by exactly one token. Failure mode: an unrecognized character (e.g., a stray control character) causes a lexer error with position. - The parser consumes the token stream using grammar rules and builds the AST. Invariant: the AST structure matches the grammar production rules. Failure mode: an unexpected token triggers an error with the expected alternatives and actual token.
- If the parser encounters an error, it enters panic-mode recovery: skip forward to the next section delimiter and resume parsing. Invariant: recovery never silently drops a section boundary.
- After parsing, the AST is returned with all source location metadata attached. Invariant: every AST node has a valid file:line:column reference.
- Downstream analysis passes receive the AST and report their own errors separately from parse errors.
Minimal concrete example
EBNF grammar sketch for .prompt files:
prompt_file = [ frontmatter ] , section+ ;
frontmatter = "---" , metadata_line+ , "---" ;
metadata_line = KEY , ":" , VALUE , NEWLINE ;
section = section_header , body ;
section_header = "##" , SECTION_NAME , NEWLINE ;
body = ( text_line | schema_ref | policy_ann | include )* ;
schema_ref = "@schema:" , IDENTIFIER ;
policy_ann = "@policy:" , IDENTIFIER ;
include = "@include:" , FILE_PATH ;
Example .prompt file:
---
model: gpt-4
version: 2.1
output_schema: OrderResponse
---
## system
You are an order processing assistant.
@policy:no-pii
@schema:OrderResponse
## user
{{user_query}}
Common misconceptions
- “A grammar is overkill for prompt files; regex is enough.” Regex cannot handle nested structures, error recovery, or produce an AST. Even simple prompt files have enough structure (frontmatter, multiple sections, annotations) to justify a proper grammar.
- “The grammar must handle every possible prompt format.” Start minimal. Support the constructs your lint rules need. Extend the grammar when new rules require new AST node types.
- “Parse errors and lint errors are the same thing.” Parse errors mean the file does not conform to the grammar. Lint errors mean the file parses correctly but violates a policy or style rule. Conflating them confuses users and tools.
Check-your-understanding questions
- Why does the lexer need a modal state when parsing frontmatter versus section bodies?
- What information must every AST node carry for lint rules to produce useful diagnostics?
- When would you choose recursive descent over a PEG parser generator for this project?
Check-your-understanding answers
- Inside frontmatter (between
---delimiters), the lexer must recognize KEY/COLON/VALUE patterns. Outside frontmatter, those same characters are just text content. Without modal state, the lexer would misclassify section body text as metadata. - Every AST node must carry the source file path, start line number, start column, and end position. Without this, lint findings cannot point users to the exact location of the violation.
- Choose recursive descent when you need fine-grained control over error messages and recovery behavior, which is critical for a developer-facing linter. Choose a PEG generator when the grammar is complex and you want to iterate on rules quickly from a specification file.
Real-world applications
- Terraform, ESLint, and Prettier all start with a grammar and parser that produces an AST for analysis and transformation.
- GitHub Actions workflow files use a YAML grammar with additional semantic constraints, analogous to a prompt DSL with frontmatter.
- Shopify’s Liquid template language uses a grammar to separate template directives from content blocks.
Where you’ll apply it
- Phase 1 of this project: define the grammar, build the lexer and parser, and verify that fixture files parse into the expected AST shape.
- The AST produced here is consumed by the rule engine (Concept 2) and the CI reporter (Concept 3).
References
- “Language Implementation Patterns” by Terence Parr - Chapters on LL(k) parsing and tree construction
- “Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, Ullman - Chapter 4 on syntax analysis
- “Crafting Interpreters” by Robert Nystrom - Chapters on scanning and parsing (freely available online)
- PEG.js / Peggy documentation for JavaScript PEG parser generators
Key insights The grammar is the constitution of your DSL: every lint rule, every refactoring tool, and every CI check derives its authority from what the grammar declares legal.
Summary Grammar design for prompt DSLs means defining formal lexical and syntactic rules that transform free-form prompt text into a structured AST. The AST is the foundation for all analysis. Choosing the right parser strategy (recursive descent for control, PEG for specification speed) and designing robust error recovery determines whether the linter is useful in practice or abandoned after the first frustrating error message.
Homework/Exercises to practice the concept
- Write an EBNF grammar for a prompt file format that supports: frontmatter with at least 5 metadata keys, system/user/assistant sections,
@schemareferences, and@policyannotations. - Implement a hand-written lexer (in pseudocode) that handles the modal frontmatter vs body distinction and emits at least 8 distinct token types.
- Draw the expected AST (as a tree diagram) for a prompt file with 2 metadata entries, a system section containing a policy annotation, and a user section containing a schema reference.
Solutions to the homework/exercises
- The EBNF should have a clear
prompt_fileroot production, afrontmatterproduction gated by---delimiters, andsectionproductions that allow interleaved text, schema refs, and policy annotations. The grammar should explicitly handle newlines as significant whitespace (section boundaries) rather than ignoring them. - The lexer pseudocode should track a
statevariable (IN_FRONTMATTER vs IN_BODY) and switch when encountering the second---. Token types should include at minimum: FRONT_DELIM, KEY, COLON, VALUE, SECTION_HEADER, TEXT_LINE, SCHEMA_REF, POLICY_ANN, and EOF. - The AST tree should show PromptFile as root, with FrontmatterBlock and Section children. Each MetadataEntry node under FrontmatterBlock should have key/value fields. Each Section node should have a name field and children for TextBlock, SchemaReference, or PolicyAnnotation nodes. Every node should note file:line:col.
Static Policy Analysis for Prompts
Fundamentals Static analysis means examining source artifacts without executing them to find bugs, policy violations, and style issues. In traditional software engineering, linters like ESLint, Pylint, and Clippy work this way: they parse source code into an AST, walk the tree, and flag patterns that match known problems. Static policy analysis for prompts applies the same principle to prompt files. Instead of checking for unused variables or null pointer risks, you check for missing output schemas, unsafe tool-call patterns, prompt injection vulnerabilities, and style inconsistencies. The key advantage is that these checks run at authoring time or in CI, before the prompt ever reaches an LLM, which makes them fast, deterministic, and cheap.
Deep Dive into the concept The architecture of a prompt lint engine has three parts: a rule registry, a tree visitor, and a finding reporter.
The rule registry is a catalog of all lint rules. Each rule has an ID (like P008_NO_TOOL_CALL_WITHOUT_POLICY), a severity level (error, warning, info), a category (security, style, maintainability), and a visitor function that inspects AST nodes. Rules are loaded at startup and can be configured per-project through a lint config file that enables, disables, or adjusts severity of individual rules.
The tree visitor is the execution mechanism. The engine walks the AST produced by the parser (from Concept 1) in depth-first order. At each node, it invokes all registered rule visitors that care about that node type. For example, a rule checking “every prompt file must declare an output_schema in frontmatter” registers interest in the FrontmatterBlock node type. When the visitor reaches a FrontmatterBlock, it checks whether any MetadataEntry child has key output_schema. If not, it emits a finding. The visitor pattern decouples rule logic from traversal logic, making it easy to add new rules without modifying the engine.
Security lint rules deserve special treatment. These rules detect patterns that could lead to prompt injection, data leakage, or unauthorized tool execution. Examples: a prompt section that interpolates user input without sanitization markers, a tool-call declaration without an associated policy annotation, or a system section that lacks a safety preamble. Security rules should always be severity “error” and should be non-suppressible in CI mode. This prevents teams from silencing critical findings with inline disable comments.
Style rules enforce consistency across a prompt library. Examples: section ordering (system before user before assistant), metadata key naming conventions (snake_case), maximum prompt length per section, and required comment blocks explaining the prompt’s purpose. Style rules are typically severity “warning” and can be configured per team.
Maintainability rules catch structural issues that make prompts hard to evolve. Examples: prompts with no version metadata, prompts referencing schemas by path instead of by registered name (making refactoring fragile), sections exceeding a complexity threshold (measured by token count or nesting depth), and circular include chains.
The finding reporter takes the list of findings and formats them. The most important output format for CI integration is SARIF (Static Analysis Results Interchange Format), a JSON standard that GitHub Code Scanning, VS Code, and other tools can ingest. A SARIF finding includes: rule ID, severity, message, file path, start line, start column, end line, end column, and optional help URI. Producing SARIF output means your prompt linter integrates with the same tooling that handles ESLint, CodeQL, and Semgrep findings.
Rule severity classification is a design decision with real consequences. If too many rules are “error,” developers will resist adoption. If too few are “error,” security issues will slip through. A good starting taxonomy: security rules are always error; maintainability rules are error if they cause CI/CD problems (like circular includes) and warning otherwise; style rules are warning or info.
How this fit on projects This concept drives Phase 2 of Project 8. Once the parser produces an AST (Concept 1), the rule engine walks that AST and produces findings. The quality and specificity of your rules determine whether the linter is actually useful to prompt authors.
Definitions & key terms
- Lint rule: A named, versioned check that inspects AST nodes and emits findings when a pattern matches.
- Visitor pattern: A design pattern where an object (the visitor) traverses a data structure (the AST) and performs operations at each node without modifying the node classes.
- Finding / Diagnostic: A single lint result containing rule ID, severity, message, and source location.
- SARIF (Static Analysis Results Interchange Format): A JSON-based standard for expressing static analysis results, supported by GitHub, VS Code, and major CI platforms.
- Rule severity: Classification of a finding’s importance: error (blocks merge), warning (should fix), info (advisory).
- Suppression: An inline annotation that silences a specific rule at a specific location. Security rules should be non-suppressible.
Mental model diagram (ASCII)
Parsed AST
|
v
+------------------+
| Rule Registry |
| ┌─────────────┐ |
| │ P001: style │ |
| │ P002: style │ |
| │ P008: secur │ |
| │ P012: maint │ |
| │ ... │ |
| └─────────────┘ |
+------------------+
|
v
+------------------+
| AST Visitor |
| (depth-first |
| walk, invoke |
| matching rules |
| at each node) |
+------------------+
|
v
+------------------+
| Findings List |
| ┌──────────────┐ |
| │ rule: P008 │ |
| │ sev: error │ |
| │ file:ln:col │ |
| │ message: ... │ |
| └──────────────┘ |
+------------------+
|
v
+-----------+-----------+
| |
v v
+----------------+ +-------------------+
| Console Report | | SARIF JSON Report |
| (human-readable| | (CI/IDE ingestion)|
| with colors) | | |
+----------------+ +-------------------+
How it works (step-by-step, with invariants and failure modes)
- Load the rule registry from the lint config file. Invariant: every rule has a unique ID, a severity, and a visitor function. Failure mode: duplicate rule ID causes a startup error.
- Receive the AST from the parser. Invariant: the AST has source locations on every node. Failure mode: if the parser produced error-recovery nodes, those are flagged as “parse-error” findings before rule analysis begins.
- Walk the AST depth-first. At each node, invoke all rules registered for that node type. Invariant: rules are pure functions of the AST node and its children; they do not modify the AST. Failure mode: a rule throws an exception; the engine catches it, logs the rule ID and node location, and continues with remaining rules.
- Collect all findings into a list sorted by file path, then line number. Invariant: findings are deterministic; same AST always produces same findings in same order. Failure mode: non-deterministic rule (e.g., one that uses random sampling) is rejected at registration time.
- Format findings into the requested output format (console, SARIF JSON, or both). Invariant: SARIF output validates against the SARIF JSON schema. Failure mode: a finding with missing source location produces a SARIF entry with a “region unknown” marker instead of crashing the reporter.
Minimal concrete example
Lint rule: P008_REQUIRE_OUTPUT_SCHEMA
node_type: FrontmatterBlock
severity: error
category: maintainability
visitor pseudocode:
FUNCTION visit(frontmatter_node):
keys = [entry.key FOR entry IN frontmatter_node.children
WHERE entry.type == MetadataEntry]
IF "output_schema" NOT IN keys:
EMIT Finding(
rule_id = "P008_REQUIRE_OUTPUT_SCHEMA",
severity = "error",
message = "Prompt file missing required 'output_schema' metadata.",
location = frontmatter_node.source_location
)
SARIF output fragment for this finding:
{
"ruleId": "P008_REQUIRE_OUTPUT_SCHEMA",
"level": "error",
"message": { "text": "Prompt file missing required 'output_schema' metadata." },
"locations": [{
"physicalLocation": {
"artifactLocation": { "uri": "prompts/order_handler.prompt" },
"region": { "startLine": 1, "startColumn": 1, "endLine": 4, "endColumn": 4 }
}
}]
}
Common misconceptions
- “Lint rules should check prompt quality (whether the prompt produces good outputs).” Static analysis checks structure and policy at authoring time. Checking output quality requires running the prompt against an LLM, which is evaluation, not linting. The two are complementary but distinct.
- “A few rules are enough.” Production prompt libraries accumulate dozens of rules as teams discover new failure patterns. The rule engine must be designed for extensibility from day one.
- “SARIF is unnecessarily complex; plain text output is fine.” Plain text works for local development, but SARIF is what enables in-IDE annotations, GitHub PR annotations, and dashboarding. Supporting SARIF is what makes the linter a real tool rather than a script.
Check-your-understanding questions
- Why should security lint rules be non-suppressible in CI mode?
- What is the advantage of the visitor pattern over writing each rule as a standalone AST-walking function?
- How does producing SARIF output change the developer experience compared to plain console output?
Check-your-understanding answers
- If security rules can be suppressed, developers under deadline pressure will silence them rather than fix the underlying issue. Non-suppressible rules in CI ensure that security findings always block the merge, even if local runs allow overrides.
- The visitor pattern separates traversal from inspection. All rules share a single tree walk, which is efficient. Each rule only needs to declare which node types it cares about and what to check. Adding a new rule requires no changes to the traversal engine.
- SARIF output enables GitHub to show lint findings as inline PR annotations, VS Code to show them as squiggly underlines in the editor, and dashboards to track finding trends over time. This shifts linting from a CI log you scroll through to an integrated part of the development workflow.
Real-world applications
- ESLint (JavaScript) uses exactly this architecture: parser -> AST -> rule visitors -> findings -> formatters (including SARIF).
- Semgrep runs pattern-based static analysis over multiple languages using AST matching and outputs SARIF for CI integration.
- Terraform
validateandtflintcheck infrastructure-as-code files for policy compliance before deployment.
Where you’ll apply it
- Phase 2 of this project: implement the rule engine, write initial rules for each category (security, style, maintainability), and produce SARIF output.
- The findings produced here feed into the CI gate (Concept 3).
References
- SARIF specification: OASIS Static Analysis Results Interchange Format (sarif-standard)
- “Language Implementation Patterns” by Terence Parr - Chapter on tree visitors and pattern matchers
- ESLint architecture documentation: rule authoring guide
- Semgrep documentation: custom rule authoring
Key insights The linter’s value is proportional to the specificity of its rules and the actionability of its findings; a vague warning that developers ignore is worse than no warning at all.
Summary Static policy analysis for prompts means building a rule engine that walks the prompt AST, checks each node against a registry of categorized rules, and produces structured findings (ideally in SARIF format) that integrate with CI pipelines and IDEs. The visitor pattern decouples rule logic from traversal, making the system extensible. Security rules must be non-suppressible; style rules should be configurable. The quality of error messages and source location precision determines whether developers adopt or ignore the tool.
Homework/Exercises to practice the concept
- Design 5 lint rules for a prompt DSL: 2 security rules, 2 style rules, and 1 maintainability rule. For each, specify: rule ID, severity, node type it inspects, the condition it checks, and the error message it produces.
- Write pseudocode for an AST visitor that walks a PromptFile tree and invokes registered rules. Show how a rule registers interest in a node type and how the visitor dispatches.
- Produce a sample SARIF JSON document containing 3 findings from different rules, showing proper use of ruleId, level, message, and location fields.
Solutions to the homework/exercises
- Security rules might include:
P_SEC_001: system section must not interpolate raw user input(checks TextBlock children of system Section for template variable patterns without sanitization markers) andP_SEC_002: tool-call declaration requires policy annotation(checks Section nodes containing tool references for adjacent PolicyAnnotation siblings). Style rules:P_STY_001: sections must appear in order system, user, assistant(checks Section ordering in PromptFile children) andP_STY_002: metadata keys must be snake_case(checks MetadataEntry key fields against a regex). Maintainability:P_MNT_001: prompt file must have version metadata(checks FrontmatterBlock for a MetadataEntry with key “version”). - The visitor pseudocode should show a
registrymap from node type to list of rule functions, awalk(node)function that invokes matching rules then recurses into children, and rule functions that accept a node and return a (possibly empty) list of Finding objects. - The SARIF document should have a top-level
runsarray with one run containing atoolobject (linter name and version), aresultsarray with 3 entries, each havingruleId,level,message.text, andlocations[0].physicalLocationwithartifactLocation.uriandregionwith line/column numbers.
Release Governance for Prompt Artifacts
Fundamentals Release governance for prompt artifacts means treating prompt files and their lint rules as versioned, auditable artifacts that flow through a CI/CD pipeline with explicit gate policies. In traditional software, you would never deploy a code change without running tests and getting a PR review. Prompt files deserve the same discipline. Without CI gates, prompt changes slip into production unchecked, lint rules silently diverge between teams, and nobody knows which version of a prompt is running where. This concept closes the loop: every prompt change is parsed, linted, gated, and recorded before it can merge. Every lint rule change is itself versioned and tested against a corpus of known-good and known-bad fixtures.
Deep Dive into the concept
The CI integration pipeline for prompt linting has a specific structure. When a developer opens a pull request that modifies .prompt files, the CI pipeline triggers the linter. The pipeline must: (1) parse all changed prompt files, (2) run the full rule set against the parsed ASTs, (3) classify findings by severity, (4) apply the gate policy (block on any error-severity finding, annotate warnings), and (5) publish the SARIF report as a PR check artifact.
Gate policies define what blocks a merge versus what merely warns. A strict policy blocks on any error-severity finding and requires zero security-category findings. A lenient policy might allow warnings to pass through but still block on security errors. The gate policy should be declared in a config file (e.g., .prompt-lint.yaml) that lives in the repository root, version-controlled alongside the prompt files. This means the policy itself is reviewable and auditable.
Version tracking has two dimensions: prompt file versions and lint rule set versions. Prompt files should carry a version field in their frontmatter (e.g., version: 2.3). Lint rule sets should be versioned as a whole (e.g., ruleset: 1.7.0) so that when a finding appears, you can determine which version of the rules produced it. This is critical for trend analysis: if finding counts spike, was it because new rules were added, or because prompt quality actually degraded?
Audit trail requirements connect findings to specific pipeline runs. Every lint execution should produce a report that records: the pipeline run ID, the git commit SHA, the ruleset version, the list of files checked, the list of findings, and the gate decision (pass/block). These reports should be stored as immutable artifacts (e.g., in CI artifact storage or an object store). This enables retroactive analysis: “when did we start seeing P008_MISSING_OUTPUT_SCHEMA findings in the payments team’s prompts?”
Rule set evolution is a governance challenge. When you add a new lint rule, it may immediately produce findings across the entire prompt corpus. If the rule is severity “error,” it blocks every PR until all existing prompts are fixed. A phased rollout strategy avoids this: introduce new rules at severity “warning” for a grace period, give teams time to remediate, then promote to “error.” The lint config file should support per-rule severity overrides with expiration dates to manage this transition.
Rollback of lint rules is also a real scenario. If a new rule produces false positives in production-critical PRs, you need a way to quickly revert the ruleset version without reverting unrelated changes. Treating the ruleset as a versioned artifact with a clear release process (not just code in main) enables this.
How this fit on projects This concept drives Phase 3 of Project 8 and determines whether the linter is actually adopted by teams. A linter that is not integrated into CI is a linter that nobody runs.
Definitions & key terms
- Gate policy: A rule that maps lint finding severities to CI actions (block merge, annotate PR, allow through).
- Ruleset version: A version identifier for the collection of lint rules applied during a pipeline run.
- Lint config file: A repository-level configuration (e.g.,
.prompt-lint.yaml) that specifies which rules are enabled, their severity overrides, and the gate policy. - Audit artifact: An immutable record of a lint execution including run ID, commit SHA, ruleset version, findings, and gate decision.
- Grace period: A time window during which a new rule is enforced at a lower severity to allow teams to remediate before it blocks merges.
- Phased rollout: The practice of introducing new lint rules at warning severity before promoting them to error severity.
Mental model diagram (ASCII)
Developer opens PR with .prompt file changes
|
v
+-------------------+
| CI Pipeline |
| Triggered |
+-------------------+
|
v
+-------------------+ +-------------------+
| Parse .prompt |---->| Lint with |
| files (Concept 1)| | Rule Engine |
+-------------------+ | (Concept 2) |
+-------------------+
|
v
+-------------------+
| Gate Policy |
| Evaluation |
| |
| errors > 0? |
| security > 0? |
+-------------------+
/ \
v v
+----------+ +-----------+
| BLOCK | | PASS |
| merge | | merge |
| (red | | (green |
| check) | | check) |
+----------+ +-----------+
\ /
v v
+-------------------+
| Publish SARIF |
| + Audit Artifact |
| (run_id, sha, |
| ruleset_ver, |
| findings, gate |
| decision) |
+-------------------+
How it works (step-by-step, with invariants and failure modes)
- A PR is opened or updated with changes to
.promptfiles or lint rule definitions. Invariant: the CI pipeline is triggered on every PR that touches these paths. Failure mode: misconfigured CI path filters cause the pipeline to skip prompt file changes. - The pipeline checks out the code, installs the linter, and loads the lint config. Invariant: the lint config file is present and valid. Failure mode: missing or malformed config causes the pipeline to fail with a clear error (not a silent pass).
- The linter parses all changed
.promptfiles and runs the rule engine. Invariant: parse and lint are deterministic; same input always produces same findings. Failure mode: a non-deterministic rule causes inconsistent gate decisions across re-runs. - The gate policy evaluates findings. Invariant: error-severity security findings always block; this is non-negotiable. Failure mode: a misconfigured gate policy allows security errors through; defense: the linter hardcodes that security-error findings always block regardless of config.
- The pipeline publishes the SARIF report and the audit artifact. Invariant: the audit artifact is immutable after creation. Failure mode: artifact storage failure; defense: the pipeline retries artifact upload and fails the build if storage is unreachable.
- The PR displays inline annotations from the SARIF report. Invariant: annotations point to correct file:line:column positions. Failure mode: source locations are off due to git diff line number shifting; defense: the linter runs against the PR head commit, not the diff.
Minimal concrete example
.prompt-lint.yaml:
ruleset_version: "1.7.0"
gate_policy:
on_error: block
on_warning: annotate
on_info: silent
rules:
P008_REQUIRE_OUTPUT_SCHEMA:
severity: error
P_STY_001_SECTION_ORDER:
severity: warning
P_SEC_002_TOOL_POLICY_REQUIRED:
severity: error
suppressible: false
P_MNT_003_MAX_SECTION_TOKENS:
severity: warning
threshold: 2000
grace_period_until: "2025-09-01"
CI config snippet (GitHub Actions):
- name: Lint prompt files
run: npx prompt-lint --config .prompt-lint.yaml --sarif out/lint.sarif
- name: Upload SARIF
uses: github/codeql-action/upload-sarif@v3
with:
sarif_file: out/lint.sarif
- name: Upload audit artifact
uses: actions/upload-artifact@v4
with:
name: prompt-lint-audit-${{ github.run_id }}
path: out/lint-audit.json
Common misconceptions
- “Running the linter locally is enough; CI integration is optional.” Local runs are skipped under deadline pressure. CI gates are the only reliable enforcement point. If the linter is not in CI, it might as well not exist.
- “Adding a new lint rule is a small change.” A new error-severity rule can block every open PR in the repository. New rules must be rolled out with a grace period at warning severity first.
- “Audit artifacts are only useful for compliance.” Audit artifacts are the primary tool for debugging lint regressions (“when did this rule start firing?”), tuning gate policies, and tracking team-level prompt quality trends over time.
Check-your-understanding questions
- Why should the gate policy be stored in a version-controlled config file rather than hardcoded in the CI pipeline?
- What is the risk of introducing a new error-severity lint rule without a grace period?
- What information must the audit artifact contain to enable retroactive analysis of lint trends?
Check-your-understanding answers
- A version-controlled config file makes the gate policy reviewable, auditable, and consistent across branches. Hardcoding in CI means changes to the policy are hidden in pipeline config that most developers never review.
- Every open PR in the repository that touches prompt files will be blocked until the existing prompts are remediated. This creates a brownout where no prompt changes can merge, frustrating teams and potentially leading to pressure to disable the rule entirely.
- At minimum: pipeline run ID, git commit SHA, ruleset version, list of files checked, list of findings (with rule IDs, severities, and locations), and the gate decision (pass or block with reason).
Real-world applications
- GitHub uses CodeQL in CI pipelines with SARIF upload and PR annotations for security analysis of code changes.
- Google’s large-scale code review system enforces style and correctness rules at submission time with automated checks.
- Terraform Cloud runs
terraform validateand policy checks (Sentinel/OPA) as gate conditions before infrastructure changes can be applied.
Where you’ll apply it
- Phase 3 of this project: configure the CI pipeline, define the gate policy, implement audit artifact generation, and verify that the end-to-end flow (PR -> lint -> gate -> annotate -> artifact) works correctly.
- This concept also feeds into Project 15 (Prompt Registry) where prompt versions and lint results become part of a centralized catalog.
References
- GitHub Code Scanning and SARIF upload documentation
- “Accelerate” by Forsgren, Humble, Kim - Chapters on CI/CD practices and change failure rates
- Open Policy Agent (OPA) documentation for policy-as-code patterns
- Google Engineering Practices: code review developer guide
Key insights A linter without CI integration is a suggestion; a linter with CI gates, SARIF output, and audit artifacts is a governance system.
Summary Release governance for prompt artifacts means embedding the prompt linter into CI/CD pipelines with explicit gate policies (block on error, annotate on warning), version-tracked rule sets with phased rollout strategies, and immutable audit artifacts that enable trend analysis and retroactive debugging. The gate policy lives in a version-controlled config file. Security-error findings always block, no exceptions. New rules are introduced at warning severity with a grace period before promotion to error. Every pipeline run produces an audit artifact linking the commit SHA, ruleset version, findings, and gate decision.
Homework/Exercises to practice the concept
- Write a
.prompt-lint.yamlconfig file with at least 6 rules across security, style, and maintainability categories. Include one rule with a grace period and one non-suppressible security rule. Define the gate policy. - Design a CI pipeline (in pseudocode or GitHub Actions YAML) that runs the linter, uploads SARIF, and produces an audit artifact. Include a step that fails the build if the gate policy blocks.
- Describe a scenario where a new lint rule is rolled out poorly (causes widespread PR blockage) and write the rollback procedure.
Solutions to the homework/exercises
- The config should show
gate_policywithon_error: blockandon_warning: annotate. Rules should include at least onesuppressible: falsesecurity rule and one withgrace_period_untilset to a future date. The rule at grace period should haveseverity: warningwith a note that it will be promoted toseverity: errorafter the date passes. - The CI pipeline should have steps: checkout, install linter, run linter with
--configand--sarifflags, upload SARIF usinggithub/codeql-action/upload-sarif, upload audit artifact usingactions/upload-artifact, and a final step that checks the linter exit code (0 = pass, non-zero = block). The audit artifact step should run even if the lint step fails (usingif: always()). - The rollback scenario: a new rule
P_MNT_005_NO_DEEP_NESTINGis added at severity “error” without a grace period. 40 open PRs across 8 teams are immediately blocked. The rollback procedure: (1) open a PR that sets the rule’s severity to “warning” in.prompt-lint.yaml, (2) fast-track merge that config change, (3) re-run blocked PRs, (4) schedule a team-by-team remediation plan, (5) set agrace_period_untildate, (6) promote to “error” after remediation is complete.
3. Project Specification
3.1 What You Will Build
A domain-specific prompt language and linter that enforces style, safety, and maintainability rules in CI.
3.2 Functional Requirements
- Define DSL syntax for prompt metadata, sections, and output schema refs.
- Parse prompt files into AST and run lint rules.
- Classify lint findings by severity (error/warn/info).
- Fail CI on security-critical rules.
3.3 Non-Functional Requirements
- Performance: Lint 100 prompt files under 5 seconds.
- Reliability: Lint output ordering and codes are deterministic.
- Security/Policy: Security lint rules are non-bypassable in CI mode.
3.4 Example Usage / Output
$ npm run lint:prompts --workspace p08-prompt-dsl
> p08-prompt-dsl@1.0.0 lint:prompts
[INFO] Parsed 37 prompt files
[PASS] style rules: 37/37
[PASS] safety rules: 37/37
[PASS] required metadata blocks present
[INFO] SARIF report: out/p08/lint.sarif
3.5 Data Formats / Schemas / Protocols
- Prompt DSL files (
.prompt) with frontmatter + sections. - AST JSON dump for debugging parser behavior.
- SARIF output for GitHub code scanning ingestion.
3.6 Edge Cases
- Prompt file includes unknown metadata key.
- Nested include chain introduces circular reference.
- Rule depends on schema file that moved path.
- Large monolithic prompt exceeds complexity threshold.
3.7 Real World Outcome
This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.
3.7.1 How to Run (Copy/Paste)
$ npm run lint:prompts --workspace p08-prompt-dsl
- Working directory:
project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS - Required inputs: project fixtures under
fixtures/ - Output directory:
out/p08
3.7.2 Golden Path Demo (Deterministic)
Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.
3.7.3 If CLI: exact terminal transcript
$ npm run lint:prompts --workspace p08-prompt-dsl
> p08-prompt-dsl@1.0.0 lint:prompts
[INFO] Parsed 37 prompt files
[PASS] style rules: 37/37
[PASS] safety rules: 37/37
[PASS] required metadata blocks present
[INFO] SARIF report: out/p08/lint.sarif
$ echo $?
0
Failure demo:
$ npm run lint:prompts --workspace p08-prompt-dsl -- --file prompts/bad/policy_violation.prompt
[ERROR] prompts/bad/policy_violation.prompt:12 rule P008_NO_TOOL_CALL_WITHOUT_POLICY failed
[ERROR] prompts/bad/policy_violation.prompt:19 rule P008_MISSING_OUTPUT_SCHEMA failed
$ echo $?
2
4. Solution Architecture
4.1 High-Level Design
User Input / Trigger
|
v
+-------------------------+
| Parser |
+-------------------------+
|
v
+-------------------------+
| Rule Engine |
+-------------------------+
|
v
+-------------------------+
| CI Reporter |
+-------------------------+
|
v
Artifacts / API / UI / Logs
4.2 Key Components
| Component | Responsibility | Key Decisions | |———–|—————-|—————| | Parser | Converts DSL text into AST. | Fail with precise line/column diagnostics. | | Rule Engine | Runs style/safety/maintainability checks. | Security rules have highest severity and block merges. | | CI Reporter | Exports findings to SARIF/JSON. | Provide stable rule IDs for trend tracking. |
4.3 Data Structures (No Full Code)
P08_Request:
- trace_id
- input payload/context
- policy profile
P08_Decision:
- status (ALLOW | DENY | RETRY | ESCALATE | PROMOTE | ROLLBACK)
- reason_code
- artifact pointers
4.4 Algorithm Overview
Key algorithm: Policy-aware decision pipeline
- Normalize input and attach deterministic trace metadata.
- Run contract/schema validation and project-specific core checks.
- Apply policy gates and decide: success, retry, deny, escalate, or rollback.
- Persist artifacts and publish operational metrics.
Complexity Analysis (conceptual):
- Time: O(n) over fixture/request items in a batch run.
- Space: O(n) for traces and report artifacts.
5. Implementation Guide
5.1 Development Environment Setup
# 1) Install dependencies
# 2) Prepare fixtures under fixtures/
# 3) Run the project command(s) listed in section 3.7
5.2 Project Structure
p08/
├── src/
├── fixtures/
├── policies/
├── out/
└── README.md
5.3 The Core Question You’re Answering
“How do I make prompts reviewable, lintable, and maintainable like normal code?”
This question matters because it forces the project to produce objective evidence instead of relying on subjective prompt impressions.
5.4 Concepts You Must Understand First
- DSL grammar design
- Why does this concept matter for P08?
- Book Reference: “Language Implementation Patterns” by Terence Parr
- Static analysis rules
- Why does this concept matter for P08?
- Book Reference: Compiler linting and AST rule engines
- Policy-as-lint checks
- Why does this concept matter for P08?
- Book Reference: Secure coding standards adapted for prompts
5.5 Questions to Guide Your Design
- Boundary and contracts
- What is the smallest safe contract surface for prompt dsl + linter?
- Which failure reasons must be explicit and machine-readable?
- Runtime policy
- What is allowed automatically, what needs retry, and what must escalate?
- Which policy checks must happen before any side effect?
- Evidence and observability
- What traces/metrics are required for fast incident triage?
- What specific thresholds trigger rollback or human review?
5.6 Thinking Exercise
Pre-Mortem for Prompt DSL + Linter
Before implementing, write down 10 ways this project can fail in production. Classify each failure into: contract, policy, security, or operations.
Questions to answer:
- Which failures can be prevented before runtime?
- Which failures require runtime detection and escalation?
5.7 The Interview Questions They’ll Ask
- “Why introduce a DSL instead of plain markdown prompts?”
- “How do you design lint rules that stay useful over time?”
- “What belongs in a prompt AST?”
- “How would you make lint findings developer-friendly?”
- “Which prompt issues are best caught statically?”
5.8 Hints in Layers
Hint 1: Start with smallest grammar Support only mandatory constructs first.
Hint 2: Write fixture files per rule Every lint rule needs pass/fail fixtures.
Hint 3: Separate parser from rule engine Keep syntax errors distinct from policy errors.
Hint 4: Integrate with CI early Prompt linting only works when merged into developer workflow.
5.9 Books That Will Help
| Topic | Book | Chapter | |——-|——|———| | DSL construction | “Language Implementation Patterns” by Terence Parr | Core pattern chapters | | Parser techniques | “Compilers: Principles, Techniques, and Tools” | Parsing + semantic analysis | | DevEx at scale | “Accelerate” by Forsgren et al. | Change quality chapters |
5.10 Implementation Phases
Phase 1: Foundation
- Define contracts, policy profiles, and deterministic fixtures.
- Build the core execution path and baseline artifact output.
- Checkpoint: One golden-path scenario runs end-to-end with trace id and artifact.
Phase 2: Core Functionality
- Add project-specific evaluation/routing/verification logic.
- Add error paths with unified reason codes.
- Checkpoint: Golden-path and one failure-path both behave deterministically.
Phase 3: Operational Hardening
- Add metrics, trend reporting, and release/rollback or escalation gates.
- Document runbook and incident/debug flow.
- Checkpoint: Team member can reproduce output from clean checkout.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale | |———-|———|—————-|———–| | Validation order | Late checks vs early checks | Early checks | Fail-fast saves cost and reduces unsafe execution | | Failure handling | Silent retries vs explicit reason codes | Explicit reason codes | Enables automation and faster debugging | | Rollout/escalation | Manual-only vs policy-driven | Policy-driven with manual override | Balances speed and safety |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples | |———-|———|———-| | Unit Tests | Validate deterministic building blocks | schema checks, policy gates, parser behaviors | | Integration Tests | Verify end-to-end project path | golden-path command/API flow | | Edge Case Tests | Ensure robust failure handling | malformed fixture, blocked policy action |
6.2 Critical Test Cases
- Golden path succeeds and emits expected artifact shape.
- High-risk/invalid path returns deterministic error with reason code.
- Replay with same seed/config yields same decision summary.
6.3 Test Data
fixtures/golden_case.*
fixtures/failure_case.*
fixtures/edge_cases/*
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution | |———|———|———-| | “Lint rules are noisy” | Rules are too broad or not context-aware. | Add rule-scoping and suppressions with justification. | | “Parser errors are hard to fix” | Diagnostics lack position and expected token. | Return line/column + nearest valid grammar hint. | | “Teams bypass lint locally” | Rules only run manually. | Enforce lint in CI and pre-commit hooks. |
7.2 Debugging Strategies
- Re-run deterministic fixtures with fixed seed and compare trace ids.
- Diff latest artifacts against last known-good baseline.
- Isolate whether failure is contract, policy, or runtime dependency related.
7.3 Performance Traps
- Unbounded retries inflate latency and cost.
- Overly broad logging can slow hot paths.
- Missing cache/canonicalization can create avoidable compute churn.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add one new fixture category and expected outcome labels.
- Add one new reason code with deterministic validation.
8.2 Intermediate Extensions
- Add dashboard-ready trend exports.
- Add automated regression diff against previous run artifacts.
8.3 Advanced Extensions
- Integrate with rollout gates or human approval workflows.
- Add chaos-style fault injection and recovery assertions.
9. Real-World Connections
9.1 Industry Applications
- PromptOps platform teams operating AI features under compliance constraints.
- Internal AI governance tooling for release safety and incident response.
9.2 Related Open Source Projects
- LangChain/LangSmith style eval and tracing workflows.
- OpenTelemetry-based observability stacks for decision traces.
9.3 Interview Relevance
- Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
- Shows practical production-thinking: contracts, policies, monitoring, and operational controls.
10. Resources
10.1 Essential Reading
- OpenAI/Anthropic/Google provider docs for structured outputs, tool calling, and prompt controls.
- OWASP LLM Top 10 and NIST AI RMF guidance for safety and governance.
10.2 Video Resources
- Talks on LLM eval systems, PromptOps, and AI safety operations.
10.3 Tools & Documentation
- JSON schema validators, policy engines, and tracing infrastructure docs.
10.4 Related Projects in This Series
- Previous projects: build specialized primitives.
- Next projects: integrate these primitives into broader operational systems.
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the core risk boundaries and policy gates for this project.
- I can explain the artifact format and why each field exists.
- I can justify the release/escalation criteria.
11.2 Implementation
- Golden-path and failure-path flows both work.
- Deterministic artifacts are produced and reproducible.
- Observability fields are present for debugging and audits.
11.3 Growth
- I can describe one tradeoff I made and why.
- I can explain this project design in an interview setting.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Golden path works with deterministic output artifact.
- At least one failure-path scenario returns unified error shape/reason code.
- Core metrics are emitted and documented.
Full Completion:
- Includes automated tests, trend reporting, and reproducible runbook.
- Includes operational thresholds for promote/rollback or escalate/approve.
Excellence (Above & Beyond):
- Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.
- Demonstrates incident drill replay and fast root-cause workflow.