Project 8: Prompt DSL + Linter

Lint reports, style gates, and policy rule violations in CI.

Quick Reference

Attribute	Value
Difficulty	Level 3: Advanced
Time Estimate	5-10 days (capstone: 3-5 weeks)
Main Programming Language	TypeScript
Alternative Programming Languages	Python, Rust
Coolness Level	Level 4: Platform Builder
Business Potential	4. Developer Tooling
Knowledge Area	Prompt Tooling
Software or Tool	DSL parser + static checks
Main Book	Language Implementation Patterns (Parr)
Concept Clusters	Tool Calling and MCP Interoperability; Prompt Contracts and Output Typing

1. Learning Objectives

By completing this project, you will:

Design a formal grammar (BNF/EBNF) for a prompt DSL that captures metadata, sections, output schema references, and policy annotations.
Implement a lexer and recursive-descent parser that converts .prompt files into a typed abstract syntax tree (AST).
Build a rule engine using the visitor pattern over ASTs that classifies findings by severity (error, warning, info) and category (style, safety, maintainability).
Produce SARIF-formatted lint reports consumable by GitHub Code Scanning, VS Code, and CI pipelines.
Integrate prompt linting as a merge-blocking gate in a CI pipeline with version-tracked rule sets.
Design audit trails that link every lint finding to a rule version, prompt file version, and pipeline run ID.

2. All Theory Needed (Per-Concept Breakdown)

Grammar Design for Prompt DSLs

Fundamentals A grammar is a set of formal rules that defines which strings belong to a language. For a prompt DSL, the grammar specifies the legal structure of prompt files: what sections exist, how metadata is declared, where output schema references appear, and which tokens are meaningful versus decorative. Without a grammar, prompt files are just free-form text that no tool can reliably parse, validate, or transform. Grammar design is the first act of this project because every downstream capability (linting, refactoring, diff-analysis) depends on having a deterministic parse tree. You are not building a general-purpose programming language; you are building a structured document format with enough formality that machines can reason about prompt content before it ever reaches an LLM.

Deep Dive into the concept Designing a grammar for prompt files requires deciding on three layers: lexical structure (tokens), syntactic structure (parse rules), and semantic constraints (what combinations are meaningful).

At the lexical level, you define token types. A prompt DSL typically needs: FRONTMATTER_DELIMITER (the --- markers), KEY, COLON, VALUE, SECTION_HEADER (like ## system or ## user), OUTPUT_SCHEMA_REF (a reference like @schema:OrderResponse), POLICY_ANNOTATION (like @policy:no-pii), TEXT_BLOCK (free-form prompt content), and COMMENT. The lexer scans the source file character by character and emits a stream of these tokens. A critical design choice is whether your lexer is context-free or context-sensitive. Frontmatter parsing often requires a modal lexer: between the first and second --- delimiters, the lexer emits KEY/COLON/VALUE tokens; outside that region, it emits section headers and text blocks.

At the syntactic level, you write production rules. These are typically expressed in BNF or EBNF notation. Each rule describes how tokens combine into larger structures. For example, a prompt file is a sequence: optional frontmatter block, followed by one or more sections, where each section has a header and a body. The parser reads the token stream and builds an Abstract Syntax Tree (AST) where each node represents a structural element. The AST is the central data structure for all downstream analysis.

Two main parser implementation strategies apply here. Recursive descent is the simplest: you write one function per grammar rule, and each function consumes tokens and returns an AST node. This approach is easy to debug and produces excellent error messages because you always know which rule you are in when a parse error occurs. PEG (Parsing Expression Grammar) parsers are an alternative that handles ambiguity through ordered choice: the first matching alternative wins. PEG parsers are often generated from a grammar specification file, which means less hand-written code but sometimes harder-to-control error messages.

Semantic constraints go beyond syntax. A file might parse correctly but violate rules like “every prompt file must have an output_schema key in frontmatter” or “the system section must appear before the user section.” These constraints are not part of the grammar itself; they are checked by walking the AST after parsing. This separation matters: syntax errors are reported by the parser, semantic errors are reported by analysis passes.

AST node types for a prompt DSL typically include: PromptFile (root), FrontmatterBlock, MetadataEntry (key-value pair), Section (header + body), TextBlock, SchemaReference, PolicyAnnotation, IncludeDirective (if the DSL supports composition), and Comment. Each node carries source location metadata (file path, line number, column number) so that lint findings can point to exact positions.

Error recovery is a design challenge. If the parser encounters an unexpected token, should it abort or try to recover and continue parsing? For a linter, recovery is important because you want to report as many issues as possible in a single run, not just the first syntax error. A common recovery strategy is “panic mode”: skip tokens until you find a synchronization point (like the next section header), then resume parsing from there.

How this fit on projects Grammar design is the foundation of Project 8. The grammar specification defines what the parser accepts, which determines what the AST looks like, which determines what lint rules can inspect. Every other component in the project depends on the quality and completeness of the grammar.

Definitions & key terms

BNF/EBNF: Backus-Naur Form / Extended BNF. Notation systems for writing context-free grammars. EBNF adds repetition (*, +) and optional (?) operators.
Token: The smallest meaningful unit produced by the lexer (e.g., a keyword, a delimiter, a text block).
AST (Abstract Syntax Tree): A tree representation of parsed source where each node corresponds to a grammar construct. Unlike a parse tree, the AST omits syntactically redundant tokens.
Recursive descent parser: A top-down parser built from mutually recursive functions, one per grammar rule.
PEG (Parsing Expression Grammar): A parsing formalism where alternatives are tried in order; the first match wins.
Source location: File path, line, and column attached to every AST node for diagnostics.

Mental model diagram (ASCII)

Source .prompt File
        |
        v
+------------------+
|      Lexer       |  character stream -> token stream
+------------------+
        |
        v
  [FRONT_DELIM] [KEY:model] [COLON] [VALUE:gpt-4]
  [FRONT_DELIM] [SECTION:system] [TEXT_BLOCK:...]
  [SCHEMA_REF:@schema:Order] [SECTION:user] ...
        |
        v
+------------------+
|     Parser       |  token stream -> AST
| (recursive       |
|  descent / PEG)  |
+------------------+
        |
        v
   PromptFile (root)
   ├── FrontmatterBlock
   │   ├── MetadataEntry(model, "gpt-4")
   │   ├── MetadataEntry(version, "2.1")
   │   └── MetadataEntry(output_schema, "Order")
   ├── Section(system)
   │   └── TextBlock("You are a helpful...")
   ├── Section(user)
   │   └── TextBlock("{{user_query}}")
   └── PolicyAnnotation(@policy:no-pii)
        |
        v
+------------------+
| Analysis Passes  |  AST -> Findings
+------------------+

How it works (step-by-step, with invariants and failure modes)

The lexer reads the .prompt file and emits a token stream. Invariant: every byte in the source is accounted for by exactly one token. Failure mode: an unrecognized character (e.g., a stray control character) causes a lexer error with position.
The parser consumes the token stream using grammar rules and builds the AST. Invariant: the AST structure matches the grammar production rules. Failure mode: an unexpected token triggers an error with the expected alternatives and actual token.
If the parser encounters an error, it enters panic-mode recovery: skip forward to the next section delimiter and resume parsing. Invariant: recovery never silently drops a section boundary.
After parsing, the AST is returned with all source location metadata attached. Invariant: every AST node has a valid file:line:column reference.
Downstream analysis passes receive the AST and report their own errors separately from parse errors.

Minimal concrete example

EBNF grammar sketch for .prompt files:

  prompt_file   = [ frontmatter ] , section+ ;
  frontmatter   = "---" , metadata_line+ , "---" ;
  metadata_line = KEY , ":" , VALUE , NEWLINE ;
  section       = section_header , body ;
  section_header = "##" , SECTION_NAME , NEWLINE ;
  body          = ( text_line | schema_ref | policy_ann | include )* ;
  schema_ref    = "@schema:" , IDENTIFIER ;
  policy_ann    = "@policy:" , IDENTIFIER ;
  include       = "@include:" , FILE_PATH ;

Example .prompt file:

  ---
  model: gpt-4
  version: 2.1
  output_schema: OrderResponse
  ---
  ## system
  You are an order processing assistant.
  @policy:no-pii
  @schema:OrderResponse

  ## user
  {{user_query}}

Common misconceptions

“A grammar is overkill for prompt files; regex is enough.” Regex cannot handle nested structures, error recovery, or produce an AST. Even simple prompt files have enough structure (frontmatter, multiple sections, annotations) to justify a proper grammar.
“The grammar must handle every possible prompt format.” Start minimal. Support the constructs your lint rules need. Extend the grammar when new rules require new AST node types.
“Parse errors and lint errors are the same thing.” Parse errors mean the file does not conform to the grammar. Lint errors mean the file parses correctly but violates a policy or style rule. Conflating them confuses users and tools.

Check-your-understanding questions

Why does the lexer need a modal state when parsing frontmatter versus section bodies?
What information must every AST node carry for lint rules to produce useful diagnostics?
When would you choose recursive descent over a PEG parser generator for this project?

Check-your-understanding answers

Inside frontmatter (between --- delimiters), the lexer must recognize KEY/COLON/VALUE patterns. Outside frontmatter, those same characters are just text content. Without modal state, the lexer would misclassify section body text as metadata.
Every AST node must carry the source file path, start line number, start column, and end position. Without this, lint findings cannot point users to the exact location of the violation.
Choose recursive descent when you need fine-grained control over error messages and recovery behavior, which is critical for a developer-facing linter. Choose a PEG generator when the grammar is complex and you want to iterate on rules quickly from a specification file.

Real-world applications

Terraform, ESLint, and Prettier all start with a grammar and parser that produces an AST for analysis and transformation.
GitHub Actions workflow files use a YAML grammar with additional semantic constraints, analogous to a prompt DSL with frontmatter.
Shopify’s Liquid template language uses a grammar to separate template directives from content blocks.

Where you’ll apply it

Phase 1 of this project: define the grammar, build the lexer and parser, and verify that fixture files parse into the expected AST shape.
The AST produced here is consumed by the rule engine (Concept 2) and the CI reporter (Concept 3).

References

“Language Implementation Patterns” by Terence Parr - Chapters on LL(k) parsing and tree construction
“Compilers: Principles, Techniques, and Tools” by Aho, Lam, Sethi, Ullman - Chapter 4 on syntax analysis
“Crafting Interpreters” by Robert Nystrom - Chapters on scanning and parsing (freely available online)
PEG.js / Peggy documentation for JavaScript PEG parser generators

Key insights The grammar is the constitution of your DSL: every lint rule, every refactoring tool, and every CI check derives its authority from what the grammar declares legal.

Summary Grammar design for prompt DSLs means defining formal lexical and syntactic rules that transform free-form prompt text into a structured AST. The AST is the foundation for all analysis. Choosing the right parser strategy (recursive descent for control, PEG for specification speed) and designing robust error recovery determines whether the linter is useful in practice or abandoned after the first frustrating error message.

Homework/Exercises to practice the concept

Write an EBNF grammar for a prompt file format that supports: frontmatter with at least 5 metadata keys, system/user/assistant sections, @schema references, and @policy annotations.
Implement a hand-written lexer (in pseudocode) that handles the modal frontmatter vs body distinction and emits at least 8 distinct token types.
Draw the expected AST (as a tree diagram) for a prompt file with 2 metadata entries, a system section containing a policy annotation, and a user section containing a schema reference.

Solutions to the homework/exercises

The EBNF should have a clear prompt_file root production, a frontmatter production gated by --- delimiters, and section productions that allow interleaved text, schema refs, and policy annotations. The grammar should explicitly handle newlines as significant whitespace (section boundaries) rather than ignoring them.
The lexer pseudocode should track a state variable (IN_FRONTMATTER vs IN_BODY) and switch when encountering the second ---. Token types should include at minimum: FRONT_DELIM, KEY, COLON, VALUE, SECTION_HEADER, TEXT_LINE, SCHEMA_REF, POLICY_ANN, and EOF.
The AST tree should show PromptFile as root, with FrontmatterBlock and Section children. Each MetadataEntry node under FrontmatterBlock should have key/value fields. Each Section node should have a name field and children for TextBlock, SchemaReference, or PolicyAnnotation nodes. Every node should note file:line:col.

Static Policy Analysis for Prompts

Fundamentals Static analysis means examining source artifacts without executing them to find bugs, policy violations, and style issues. In traditional software engineering, linters like ESLint, Pylint, and Clippy work this way: they parse source code into an AST, walk the tree, and flag patterns that match known problems. Static policy analysis for prompts applies the same principle to prompt files. Instead of checking for unused variables or null pointer risks, you check for missing output schemas, unsafe tool-call patterns, prompt injection vulnerabilities, and style inconsistencies. The key advantage is that these checks run at authoring time or in CI, before the prompt ever reaches an LLM, which makes them fast, deterministic, and cheap.

Deep Dive into the concept The architecture of a prompt lint engine has three parts: a rule registry, a tree visitor, and a finding reporter.

The rule registry is a catalog of all lint rules. Each rule has an ID (like P008_NO_TOOL_CALL_WITHOUT_POLICY), a severity level (error, warning, info), a category (security, style, maintainability), and a visitor function that inspects AST nodes. Rules are loaded at startup and can be configured per-project through a lint config file that enables, disables, or adjusts severity of individual rules.

The tree visitor is the execution mechanism. The engine walks the AST produced by the parser (from Concept 1) in depth-first order. At each node, it invokes all registered rule visitors that care about that node type. For example, a rule checking “every prompt file must declare an output_schema in frontmatter” registers interest in the FrontmatterBlock node type. When the visitor reaches a FrontmatterBlock, it checks whether any MetadataEntry child has key output_schema. If not, it emits a finding. The visitor pattern decouples rule logic from traversal logic, making it easy to add new rules without modifying the engine.

Security lint rules deserve special treatment. These rules detect patterns that could lead to prompt injection, data leakage, or unauthorized tool execution. Examples: a prompt section that interpolates user input without sanitization markers, a tool-call declaration without an associated policy annotation, or a system section that lacks a safety preamble. Security rules should always be severity “error” and should be non-suppressible in CI mode. This prevents teams from silencing critical findings with inline disable comments.

Style rules enforce consistency across a prompt library. Examples: section ordering (system before user before assistant), metadata key naming conventions (snake_case), maximum prompt length per section, and required comment blocks explaining the prompt’s purpose. Style rules are typically severity “warning” and can be configured per team.

Maintainability rules catch structural issues that make prompts hard to evolve. Examples: prompts with no version metadata, prompts referencing schemas by path instead of by registered name (making refactoring fragile), sections exceeding a complexity threshold (measured by token count or nesting depth), and circular include chains.

The finding reporter takes the list of findings and formats them. The most important output format for CI integration is SARIF (Static Analysis Results Interchange Format), a JSON standard that GitHub Code Scanning, VS Code, and other tools can ingest. A SARIF finding includes: rule ID, severity, message, file path, start line, start column, end line, end column, and optional help URI. Producing SARIF output means your prompt linter integrates with the same tooling that handles ESLint, CodeQL, and Semgrep findings.

Rule severity classification is a design decision with real consequences. If too many rules are “error,” developers will resist adoption. If too few are “error,” security issues will slip through. A good starting taxonomy: security rules are always error; maintainability rules are error if they cause CI/CD problems (like circular includes) and warning otherwise; style rules are warning or info.

How this fit on projects This concept drives Phase 2 of Project 8. Once the parser produces an AST (Concept 1), the rule engine walks that AST and produces findings. The quality and specificity of your rules determine whether the linter is actually useful to prompt authors.

Definitions & key terms

Lint rule: A named, versioned check that inspects AST nodes and emits findings when a pattern matches.
Visitor pattern: A design pattern where an object (the visitor) traverses a data structure (the AST) and performs operations at each node without modifying the node classes.
Finding / Diagnostic: A single lint result containing rule ID, severity, message, and source location.
SARIF (Static Analysis Results Interchange Format): A JSON-based standard for expressing static analysis results, supported by GitHub, VS Code, and major CI platforms.
Rule severity: Classification of a finding’s importance: error (blocks merge), warning (should fix), info (advisory).
Suppression: An inline annotation that silences a specific rule at a specific location. Security rules should be non-suppressible.

Mental model diagram (ASCII)

                       Parsed AST
                           |
                           v
                  +------------------+
                  |  Rule Registry   |
                  |  ┌─────────────┐ |
                  |  │ P001: style │ |
                  |  │ P002: style │ |
                  |  │ P008: secur │ |
                  |  │ P012: maint │ |
                  |  │ ...         │ |
                  |  └─────────────┘ |
                  +------------------+
                           |
                           v
                  +------------------+
                  |   AST Visitor    |
                  | (depth-first     |
                  |  walk, invoke    |
                  |  matching rules  |
                  |  at each node)   |
                  +------------------+
                           |
                           v
                  +------------------+
                  |   Findings List  |
                  | ┌──────────────┐ |
                  | │ rule: P008   │ |
                  | │ sev: error   │ |
                  | │ file:ln:col  │ |
                  | │ message: ... │ |
                  | └──────────────┘ |
                  +------------------+
                           |
                           v
              +-----------+-----------+
              |                       |
              v                       v
   +----------------+     +-------------------+
   | Console Report |     | SARIF JSON Report |
   | (human-readable|     | (CI/IDE ingestion)|
   |  with colors)  |     |                   |
   +----------------+     +-------------------+

How it works (step-by-step, with invariants and failure modes)

Load the rule registry from the lint config file. Invariant: every rule has a unique ID, a severity, and a visitor function. Failure mode: duplicate rule ID causes a startup error.
Receive the AST from the parser. Invariant: the AST has source locations on every node. Failure mode: if the parser produced error-recovery nodes, those are flagged as “parse-error” findings before rule analysis begins.
Walk the AST depth-first. At each node, invoke all rules registered for that node type. Invariant: rules are pure functions of the AST node and its children; they do not modify the AST. Failure mode: a rule throws an exception; the engine catches it, logs the rule ID and node location, and continues with remaining rules.
Collect all findings into a list sorted by file path, then line number. Invariant: findings are deterministic; same AST always produces same findings in same order. Failure mode: non-deterministic rule (e.g., one that uses random sampling) is rejected at registration time.
Format findings into the requested output format (console, SARIF JSON, or both). Invariant: SARIF output validates against the SARIF JSON schema. Failure mode: a finding with missing source location produces a SARIF entry with a “region unknown” marker instead of crashing the reporter.

Minimal concrete example

Lint rule: P008_REQUIRE_OUTPUT_SCHEMA

  node_type: FrontmatterBlock
  severity: error
  category: maintainability

  visitor pseudocode:
    FUNCTION visit(frontmatter_node):
      keys = [entry.key FOR entry IN frontmatter_node.children
              WHERE entry.type == MetadataEntry]
      IF "output_schema" NOT IN keys:
        EMIT Finding(
          rule_id = "P008_REQUIRE_OUTPUT_SCHEMA",
          severity = "error",
          message = "Prompt file missing required 'output_schema' metadata.",
          location = frontmatter_node.source_location
        )

SARIF output fragment for this finding:
  {
    "ruleId": "P008_REQUIRE_OUTPUT_SCHEMA",
    "level": "error",
    "message": { "text": "Prompt file missing required 'output_schema' metadata." },
    "locations": [{
      "physicalLocation": {
        "artifactLocation": { "uri": "prompts/order_handler.prompt" },
        "region": { "startLine": 1, "startColumn": 1, "endLine": 4, "endColumn": 4 }
      }
    }]
  }

Common misconceptions

“Lint rules should check prompt quality (whether the prompt produces good outputs).” Static analysis checks structure and policy at authoring time. Checking output quality requires running the prompt against an LLM, which is evaluation, not linting. The two are complementary but distinct.
“A few rules are enough.” Production prompt libraries accumulate dozens of rules as teams discover new failure patterns. The rule engine must be designed for extensibility from day one.
“SARIF is unnecessarily complex; plain text output is fine.” Plain text works for local development, but SARIF is what enables in-IDE annotations, GitHub PR annotations, and dashboarding. Supporting SARIF is what makes the linter a real tool rather than a script.

Check-your-understanding questions

Why should security lint rules be non-suppressible in CI mode?
What is the advantage of the visitor pattern over writing each rule as a standalone AST-walking function?
How does producing SARIF output change the developer experience compared to plain console output?

Check-your-understanding answers

If security rules can be suppressed, developers under deadline pressure will silence them rather than fix the underlying issue. Non-suppressible rules in CI ensure that security findings always block the merge, even if local runs allow overrides.
The visitor pattern separates traversal from inspection. All rules share a single tree walk, which is efficient. Each rule only needs to declare which node types it cares about and what to check. Adding a new rule requires no changes to the traversal engine.
SARIF output enables GitHub to show lint findings as inline PR annotations, VS Code to show them as squiggly underlines in the editor, and dashboards to track finding trends over time. This shifts linting from a CI log you scroll through to an integrated part of the development workflow.

Real-world applications

ESLint (JavaScript) uses exactly this architecture: parser -> AST -> rule visitors -> findings -> formatters (including SARIF).
Semgrep runs pattern-based static analysis over multiple languages using AST matching and outputs SARIF for CI integration.
Terraform validate and tflint check infrastructure-as-code files for policy compliance before deployment.

Where you’ll apply it

Phase 2 of this project: implement the rule engine, write initial rules for each category (security, style, maintainability), and produce SARIF output.
The findings produced here feed into the CI gate (Concept 3).

References

SARIF specification: OASIS Static Analysis Results Interchange Format (sarif-standard)
“Language Implementation Patterns” by Terence Parr - Chapter on tree visitors and pattern matchers
ESLint architecture documentation: rule authoring guide
Semgrep documentation: custom rule authoring

Key insights The linter’s value is proportional to the specificity of its rules and the actionability of its findings; a vague warning that developers ignore is worse than no warning at all.

Summary Static policy analysis for prompts means building a rule engine that walks the prompt AST, checks each node against a registry of categorized rules, and produces structured findings (ideally in SARIF format) that integrate with CI pipelines and IDEs. The visitor pattern decouples rule logic from traversal, making the system extensible. Security rules must be non-suppressible; style rules should be configurable. The quality of error messages and source location precision determines whether developers adopt or ignore the tool.

Homework/Exercises to practice the concept

Design 5 lint rules for a prompt DSL: 2 security rules, 2 style rules, and 1 maintainability rule. For each, specify: rule ID, severity, node type it inspects, the condition it checks, and the error message it produces.
Write pseudocode for an AST visitor that walks a PromptFile tree and invokes registered rules. Show how a rule registers interest in a node type and how the visitor dispatches.
Produce a sample SARIF JSON document containing 3 findings from different rules, showing proper use of ruleId, level, message, and location fields.

Solutions to the homework/exercises

Security rules might include: P_SEC_001: system section must not interpolate raw user input (checks TextBlock children of system Section for template variable patterns without sanitization markers) and P_SEC_002: tool-call declaration requires policy annotation (checks Section nodes containing tool references for adjacent PolicyAnnotation siblings). Style rules: P_STY_001: sections must appear in order system, user, assistant (checks Section ordering in PromptFile children) and P_STY_002: metadata keys must be snake_case (checks MetadataEntry key fields against a regex). Maintainability: P_MNT_001: prompt file must have version metadata (checks FrontmatterBlock for a MetadataEntry with key “version”).
The visitor pseudocode should show a registry map from node type to list of rule functions, a walk(node) function that invokes matching rules then recurses into children, and rule functions that accept a node and return a (possibly empty) list of Finding objects.
The SARIF document should have a top-level runs array with one run containing a tool object (linter name and version), a results array with 3 entries, each having ruleId, level, message.text, and locations[0].physicalLocation with artifactLocation.uri and region with line/column numbers.

Release Governance for Prompt Artifacts

Fundamentals Release governance for prompt artifacts means treating prompt files and their lint rules as versioned, auditable artifacts that flow through a CI/CD pipeline with explicit gate policies. In traditional software, you would never deploy a code change without running tests and getting a PR review. Prompt files deserve the same discipline. Without CI gates, prompt changes slip into production unchecked, lint rules silently diverge between teams, and nobody knows which version of a prompt is running where. This concept closes the loop: every prompt change is parsed, linted, gated, and recorded before it can merge. Every lint rule change is itself versioned and tested against a corpus of known-good and known-bad fixtures.

Deep Dive into the concept The CI integration pipeline for prompt linting has a specific structure. When a developer opens a pull request that modifies .prompt files, the CI pipeline triggers the linter. The pipeline must: (1) parse all changed prompt files, (2) run the full rule set against the parsed ASTs, (3) classify findings by severity, (4) apply the gate policy (block on any error-severity finding, annotate warnings), and (5) publish the SARIF report as a PR check artifact.

Gate policies define what blocks a merge versus what merely warns. A strict policy blocks on any error-severity finding and requires zero security-category findings. A lenient policy might allow warnings to pass through but still block on security errors. The gate policy should be declared in a config file (e.g., .prompt-lint.yaml) that lives in the repository root, version-controlled alongside the prompt files. This means the policy itself is reviewable and auditable.

Version tracking has two dimensions: prompt file versions and lint rule set versions. Prompt files should carry a version field in their frontmatter (e.g., version: 2.3). Lint rule sets should be versioned as a whole (e.g., ruleset: 1.7.0) so that when a finding appears, you can determine which version of the rules produced it. This is critical for trend analysis: if finding counts spike, was it because new rules were added, or because prompt quality actually degraded?

Audit trail requirements connect findings to specific pipeline runs. Every lint execution should produce a report that records: the pipeline run ID, the git commit SHA, the ruleset version, the list of files checked, the list of findings, and the gate decision (pass/block). These reports should be stored as immutable artifacts (e.g., in CI artifact storage or an object store). This enables retroactive analysis: “when did we start seeing P008_MISSING_OUTPUT_SCHEMA findings in the payments team’s prompts?”

Rule set evolution is a governance challenge. When you add a new lint rule, it may immediately produce findings across the entire prompt corpus. If the rule is severity “error,” it blocks every PR until all existing prompts are fixed. A phased rollout strategy avoids this: introduce new rules at severity “warning” for a grace period, give teams time to remediate, then promote to “error.” The lint config file should support per-rule severity overrides with expiration dates to manage this transition.

Rollback of lint rules is also a real scenario. If a new rule produces false positives in production-critical PRs, you need a way to quickly revert the ruleset version without reverting unrelated changes. Treating the ruleset as a versioned artifact with a clear release process (not just code in main) enables this.

How this fit on projects This concept drives Phase 3 of Project 8 and determines whether the linter is actually adopted by teams. A linter that is not integrated into CI is a linter that nobody runs.

Definitions & key terms

Gate policy: A rule that maps lint finding severities to CI actions (block merge, annotate PR, allow through).
Ruleset version: A version identifier for the collection of lint rules applied during a pipeline run.
Lint config file: A repository-level configuration (e.g., .prompt-lint.yaml) that specifies which rules are enabled, their severity overrides, and the gate policy.
Audit artifact: An immutable record of a lint execution including run ID, commit SHA, ruleset version, findings, and gate decision.
Grace period: A time window during which a new rule is enforced at a lower severity to allow teams to remediate before it blocks merges.
Phased rollout: The practice of introducing new lint rules at warning severity before promoting them to error severity.

Mental model diagram (ASCII)

Developer opens PR with .prompt file changes
        |
        v
+-------------------+
|  CI Pipeline      |
|  Triggered        |
+-------------------+
        |
        v
+-------------------+     +-------------------+
|  Parse .prompt    |---->|  Lint with         |
|  files (Concept 1)|     |  Rule Engine       |
+-------------------+     |  (Concept 2)       |
                          +-------------------+
                                   |
                                   v
                          +-------------------+
                          |  Gate Policy      |
                          |  Evaluation       |
                          |                   |
                          |  errors > 0?      |
                          |  security > 0?    |
                          +-------------------+
                            /             \
                           v               v
                    +----------+    +-----------+
                    |  BLOCK   |    |  PASS     |
                    |  merge   |    |  merge    |
                    | (red     |    | (green    |
                    |  check)  |    |  check)   |
                    +----------+    +-----------+
                           \             /
                            v           v
                      +-------------------+
                      |  Publish SARIF    |
                      |  + Audit Artifact |
                      |  (run_id, sha,    |
                      |   ruleset_ver,    |
                      |   findings, gate  |
                      |   decision)       |
                      +-------------------+

How it works (step-by-step, with invariants and failure modes)

A PR is opened or updated with changes to .prompt files or lint rule definitions. Invariant: the CI pipeline is triggered on every PR that touches these paths. Failure mode: misconfigured CI path filters cause the pipeline to skip prompt file changes.
The pipeline checks out the code, installs the linter, and loads the lint config. Invariant: the lint config file is present and valid. Failure mode: missing or malformed config causes the pipeline to fail with a clear error (not a silent pass).
The linter parses all changed .prompt files and runs the rule engine. Invariant: parse and lint are deterministic; same input always produces same findings. Failure mode: a non-deterministic rule causes inconsistent gate decisions across re-runs.
The gate policy evaluates findings. Invariant: error-severity security findings always block; this is non-negotiable. Failure mode: a misconfigured gate policy allows security errors through; defense: the linter hardcodes that security-error findings always block regardless of config.
The pipeline publishes the SARIF report and the audit artifact. Invariant: the audit artifact is immutable after creation. Failure mode: artifact storage failure; defense: the pipeline retries artifact upload and fails the build if storage is unreachable.
The PR displays inline annotations from the SARIF report. Invariant: annotations point to correct file:line:column positions. Failure mode: source locations are off due to git diff line number shifting; defense: the linter runs against the PR head commit, not the diff.

Minimal concrete example

.prompt-lint.yaml:

  ruleset_version: "1.7.0"
  gate_policy:
    on_error: block
    on_warning: annotate
    on_info: silent
  rules:
    P008_REQUIRE_OUTPUT_SCHEMA:
      severity: error
    P_STY_001_SECTION_ORDER:
      severity: warning
    P_SEC_002_TOOL_POLICY_REQUIRED:
      severity: error
      suppressible: false
    P_MNT_003_MAX_SECTION_TOKENS:
      severity: warning
      threshold: 2000
      grace_period_until: "2025-09-01"

CI config snippet (GitHub Actions):

  - name: Lint prompt files
    run: npx prompt-lint --config .prompt-lint.yaml --sarif out/lint.sarif
  - name: Upload SARIF
    uses: github/codeql-action/upload-sarif@v3
    with:
      sarif_file: out/lint.sarif
  - name: Upload audit artifact
    uses: actions/upload-artifact@v4
    with:
      name: prompt-lint-audit-${{ github.run_id }}
      path: out/lint-audit.json

Common misconceptions

“Running the linter locally is enough; CI integration is optional.” Local runs are skipped under deadline pressure. CI gates are the only reliable enforcement point. If the linter is not in CI, it might as well not exist.
“Adding a new lint rule is a small change.” A new error-severity rule can block every open PR in the repository. New rules must be rolled out with a grace period at warning severity first.
“Audit artifacts are only useful for compliance.” Audit artifacts are the primary tool for debugging lint regressions (“when did this rule start firing?”), tuning gate policies, and tracking team-level prompt quality trends over time.

Check-your-understanding questions

Why should the gate policy be stored in a version-controlled config file rather than hardcoded in the CI pipeline?
What is the risk of introducing a new error-severity lint rule without a grace period?
What information must the audit artifact contain to enable retroactive analysis of lint trends?

Check-your-understanding answers

A version-controlled config file makes the gate policy reviewable, auditable, and consistent across branches. Hardcoding in CI means changes to the policy are hidden in pipeline config that most developers never review.
Every open PR in the repository that touches prompt files will be blocked until the existing prompts are remediated. This creates a brownout where no prompt changes can merge, frustrating teams and potentially leading to pressure to disable the rule entirely.
At minimum: pipeline run ID, git commit SHA, ruleset version, list of files checked, list of findings (with rule IDs, severities, and locations), and the gate decision (pass or block with reason).

Real-world applications

GitHub uses CodeQL in CI pipelines with SARIF upload and PR annotations for security analysis of code changes.
Google’s large-scale code review system enforces style and correctness rules at submission time with automated checks.
Terraform Cloud runs terraform validate and policy checks (Sentinel/OPA) as gate conditions before infrastructure changes can be applied.

Where you’ll apply it

Phase 3 of this project: configure the CI pipeline, define the gate policy, implement audit artifact generation, and verify that the end-to-end flow (PR -> lint -> gate -> annotate -> artifact) works correctly.
This concept also feeds into Project 15 (Prompt Registry) where prompt versions and lint results become part of a centralized catalog.

References

GitHub Code Scanning and SARIF upload documentation
“Accelerate” by Forsgren, Humble, Kim - Chapters on CI/CD practices and change failure rates
Open Policy Agent (OPA) documentation for policy-as-code patterns
Google Engineering Practices: code review developer guide

Key insights A linter without CI integration is a suggestion; a linter with CI gates, SARIF output, and audit artifacts is a governance system.

Summary Release governance for prompt artifacts means embedding the prompt linter into CI/CD pipelines with explicit gate policies (block on error, annotate on warning), version-tracked rule sets with phased rollout strategies, and immutable audit artifacts that enable trend analysis and retroactive debugging. The gate policy lives in a version-controlled config file. Security-error findings always block, no exceptions. New rules are introduced at warning severity with a grace period before promotion to error. Every pipeline run produces an audit artifact linking the commit SHA, ruleset version, findings, and gate decision.

Homework/Exercises to practice the concept

Write a .prompt-lint.yaml config file with at least 6 rules across security, style, and maintainability categories. Include one rule with a grace period and one non-suppressible security rule. Define the gate policy.
Design a CI pipeline (in pseudocode or GitHub Actions YAML) that runs the linter, uploads SARIF, and produces an audit artifact. Include a step that fails the build if the gate policy blocks.
Describe a scenario where a new lint rule is rolled out poorly (causes widespread PR blockage) and write the rollback procedure.

Solutions to the homework/exercises

The config should show gate_policy with on_error: block and on_warning: annotate. Rules should include at least one suppressible: false security rule and one with grace_period_until set to a future date. The rule at grace period should have severity: warning with a note that it will be promoted to severity: error after the date passes.
The CI pipeline should have steps: checkout, install linter, run linter with --config and --sarif flags, upload SARIF using github/codeql-action/upload-sarif, upload audit artifact using actions/upload-artifact, and a final step that checks the linter exit code (0 = pass, non-zero = block). The audit artifact step should run even if the lint step fails (using if: always()).
The rollback scenario: a new rule P_MNT_005_NO_DEEP_NESTING is added at severity “error” without a grace period. 40 open PRs across 8 teams are immediately blocked. The rollback procedure: (1) open a PR that sets the rule’s severity to “warning” in .prompt-lint.yaml, (2) fast-track merge that config change, (3) re-run blocked PRs, (4) schedule a team-by-team remediation plan, (5) set a grace_period_until date, (6) promote to “error” after remediation is complete.

3. Project Specification

3.1 What You Will Build

A domain-specific prompt language and linter that enforces style, safety, and maintainability rules in CI.

3.2 Functional Requirements

Define DSL syntax for prompt metadata, sections, and output schema refs.
Parse prompt files into AST and run lint rules.
Classify lint findings by severity (error/warn/info).
Fail CI on security-critical rules.

3.3 Non-Functional Requirements

Performance: Lint 100 prompt files under 5 seconds.
Reliability: Lint output ordering and codes are deterministic.
Security/Policy: Security lint rules are non-bypassable in CI mode.

3.4 Example Usage / Output

$ npm run lint:prompts --workspace p08-prompt-dsl
> p08-prompt-dsl@1.0.0 lint:prompts
[INFO] Parsed 37 prompt files
[PASS] style rules: 37/37
[PASS] safety rules: 37/37
[PASS] required metadata blocks present
[INFO] SARIF report: out/p08/lint.sarif

3.5 Data Formats / Schemas / Protocols

Prompt DSL files (.prompt) with frontmatter + sections.
AST JSON dump for debugging parser behavior.
SARIF output for GitHub code scanning ingestion.

3.6 Edge Cases

Prompt file includes unknown metadata key.
Nested include chain introduces circular reference.
Rule depends on schema file that moved path.
Large monolithic prompt exceeds complexity threshold.

3.7 Real World Outcome

This section is your golden reference. Your implementation is considered correct when your run looks materially like this and produces the same artifact types.

3.7.1 How to Run (Copy/Paste)

$ npm run lint:prompts --workspace p08-prompt-dsl

Working directory: project_based_ideas/AI_AGENTS_LLM_RAG/PROMPT_ENGINEERING_PROJECTS
Required inputs: project fixtures under fixtures/
Output directory: out/p08

3.7.2 Golden Path Demo (Deterministic)

Use the fixed seed already embedded in the command or config profile. You should see stable pass/fail totals between runs.

3.7.3 If CLI: exact terminal transcript

$ npm run lint:prompts --workspace p08-prompt-dsl
> p08-prompt-dsl@1.0.0 lint:prompts
[INFO] Parsed 37 prompt files
[PASS] style rules: 37/37
[PASS] safety rules: 37/37
[PASS] required metadata blocks present
[INFO] SARIF report: out/p08/lint.sarif
$ echo $?
0

Failure demo:

$ npm run lint:prompts --workspace p08-prompt-dsl -- --file prompts/bad/policy_violation.prompt
[ERROR] prompts/bad/policy_violation.prompt:12 rule P008_NO_TOOL_CALL_WITHOUT_POLICY failed
[ERROR] prompts/bad/policy_violation.prompt:19 rule P008_MISSING_OUTPUT_SCHEMA failed
$ echo $?
2

4. Solution Architecture

4.1 High-Level Design

User Input / Trigger
        |
        v
+-------------------------+
| Parser |
+-------------------------+
        |
        v
+-------------------------+
| Rule Engine |
+-------------------------+
        |
        v
+-------------------------+
| CI Reporter |
+-------------------------+
        |
        v
Artifacts / API / UI / Logs

4.2 Key Components

4.3 Data Structures (No Full Code)

P08_Request:
- trace_id
- input payload/context
- policy profile

P08_Decision:
- status (ALLOW | DENY | RETRY | ESCALATE | PROMOTE | ROLLBACK)
- reason_code
- artifact pointers

4.4 Algorithm Overview

Key algorithm: Policy-aware decision pipeline

Normalize input and attach deterministic trace metadata.
Run contract/schema validation and project-specific core checks.
Apply policy gates and decide: success, retry, deny, escalate, or rollback.
Persist artifacts and publish operational metrics.

Complexity Analysis (conceptual):

Time: O(n) over fixture/request items in a batch run.
Space: O(n) for traces and report artifacts.

5. Implementation Guide

5.1 Development Environment Setup

# 1) Install dependencies
# 2) Prepare fixtures under fixtures/
# 3) Run the project command(s) listed in section 3.7

5.2 Project Structure

p08/
├── src/
├── fixtures/
├── policies/
├── out/
└── README.md

5.3 The Core Question You’re Answering

“How do I make prompts reviewable, lintable, and maintainable like normal code?”

This question matters because it forces the project to produce objective evidence instead of relying on subjective prompt impressions.

5.4 Concepts You Must Understand First

DSL grammar design
- Why does this concept matter for P08?
- Book Reference: “Language Implementation Patterns” by Terence Parr
Static analysis rules
- Why does this concept matter for P08?
- Book Reference: Compiler linting and AST rule engines
Policy-as-lint checks
- Why does this concept matter for P08?
- Book Reference: Secure coding standards adapted for prompts

5.5 Questions to Guide Your Design

Boundary and contracts
- What is the smallest safe contract surface for prompt dsl + linter?
- Which failure reasons must be explicit and machine-readable?
Runtime policy
- What is allowed automatically, what needs retry, and what must escalate?
- Which policy checks must happen before any side effect?
Evidence and observability
- What traces/metrics are required for fast incident triage?
- What specific thresholds trigger rollback or human review?

5.6 Thinking Exercise

Pre-Mortem for Prompt DSL + Linter

Before implementing, write down 10 ways this project can fail in production. Classify each failure into: contract, policy, security, or operations.

Questions to answer:

Which failures can be prevented before runtime?
Which failures require runtime detection and escalation?

5.7 The Interview Questions They’ll Ask

“Why introduce a DSL instead of plain markdown prompts?”
“How do you design lint rules that stay useful over time?”
“What belongs in a prompt AST?”
“How would you make lint findings developer-friendly?”
“Which prompt issues are best caught statically?”

5.8 Hints in Layers

Hint 1: Start with smallest grammar Support only mandatory constructs first.

Hint 2: Write fixture files per rule Every lint rule needs pass/fail fixtures.

Hint 3: Separate parser from rule engine Keep syntax errors distinct from policy errors.

Hint 4: Integrate with CI early Prompt linting only works when merged into developer workflow.

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation

Define contracts, policy profiles, and deterministic fixtures.
Build the core execution path and baseline artifact output.
Checkpoint: One golden-path scenario runs end-to-end with trace id and artifact.

Phase 2: Core Functionality

Add project-specific evaluation/routing/verification logic.
Add error paths with unified reason codes.
Checkpoint: Golden-path and one failure-path both behave deterministically.

Phase 3: Operational Hardening

Add metrics, trend reporting, and release/rollback or escalation gates.
Document runbook and incident/debug flow.
Checkpoint: Team member can reproduce output from clean checkout.

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

Golden path succeeds and emits expected artifact shape.
High-risk/invalid path returns deterministic error with reason code.
Replay with same seed/config yields same decision summary.

6.3 Test Data

fixtures/golden_case.*
fixtures/failure_case.*
fixtures/edge_cases/*

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

Re-run deterministic fixtures with fixed seed and compare trace ids.
Diff latest artifacts against last known-good baseline.
Isolate whether failure is contract, policy, or runtime dependency related.

7.3 Performance Traps

Unbounded retries inflate latency and cost.
Overly broad logging can slow hot paths.
Missing cache/canonicalization can create avoidable compute churn.

8. Extensions & Challenges

8.1 Beginner Extensions

Add one new fixture category and expected outcome labels.
Add one new reason code with deterministic validation.

8.2 Intermediate Extensions

Add dashboard-ready trend exports.
Add automated regression diff against previous run artifacts.

8.3 Advanced Extensions

Integrate with rollout gates or human approval workflows.
Add chaos-style fault injection and recovery assertions.

9. Real-World Connections

9.1 Industry Applications

PromptOps platform teams operating AI features under compliance constraints.
Internal AI governance tooling for release safety and incident response.

LangChain/LangSmith style eval and tracing workflows.
OpenTelemetry-based observability stacks for decision traces.

9.3 Interview Relevance

Demonstrates ability to convert probabilistic model behavior into deterministic software guarantees.
Shows practical production-thinking: contracts, policies, monitoring, and operational controls.

10. Resources

10.1 Essential Reading

OpenAI/Anthropic/Google provider docs for structured outputs, tool calling, and prompt controls.
OWASP LLM Top 10 and NIST AI RMF guidance for safety and governance.

10.2 Video Resources

Talks on LLM eval systems, PromptOps, and AI safety operations.

10.3 Tools & Documentation

JSON schema validators, policy engines, and tracing infrastructure docs.

Previous projects: build specialized primitives.
Next projects: integrate these primitives into broader operational systems.

11. Self-Assessment Checklist

11.1 Understanding

I can explain the core risk boundaries and policy gates for this project.
I can explain the artifact format and why each field exists.
I can justify the release/escalation criteria.

11.2 Implementation

Golden-path and failure-path flows both work.
Deterministic artifacts are produced and reproducible.
Observability fields are present for debugging and audits.

11.3 Growth

I can describe one tradeoff I made and why.
I can explain this project design in an interview setting.

12. Submission / Completion Criteria

Minimum Viable Completion:

Golden path works with deterministic output artifact.
At least one failure-path scenario returns unified error shape/reason code.
Core metrics are emitted and documented.

Full Completion:

Includes automated tests, trend reporting, and reproducible runbook.
Includes operational thresholds for promote/rollback or escalate/approve.

Excellence (Above & Beyond):

Integrates with adjacent projects (registry, rollout, firewall, HITL) cleanly.
Demonstrates incident drill replay and fast root-cause workflow.

Project 8: Prompt DSL + Linter

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Grammar Design for Prompt DSLs

Static Policy Analysis for Prompts

Release Governance for Prompt Artifacts

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 If CLI: exact terminal transcript

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation

Phase 2: Core Functionality

Phase 3: Operational Hardening

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria