Project 02: Prompt Injection Firewall

Build an input firewall that detects prompt injection and routes or blocks risky inputs with explicit policy decisions.

Quick Reference

Attribute	Value
Difficulty	Level 3
Time Estimate	1-2 weeks
Main Programming Language	Python
Alternative Programming Languages	JavaScript, Go
Coolness Level	4
Business Potential	4
Prerequisites	Basic APIs, JSON handling, threat modeling
Key Topics	prompt injection, detectors, policy thresholds

1. Learning Objectives

By completing this project, you will:

Integrate at least one injection detector
Define a policy-driven block/allow workflow
Measure false positives on a benign set
Produce deterministic test transcripts

2. All Theory Needed (Per-Concept Breakdown)

Guardrails Control Plane Fundamentals

Fundamentals A guardrails control plane is the set of policies, detectors, validators, and decision rules that sit around an LLM agent to ensure it behaves safely and predictably. Unlike traditional input validation, guardrails must handle probabilistic outputs, ambiguous intent, and adversarial prompts. The control plane therefore spans the entire lifecycle of an interaction: input filtering, context validation (including RAG sources), output moderation, and tool-use permissioning. Frameworks such as Guardrails AI and NeMo Guardrails provide structured validation and dialogue control, while models like Prompt Guard or Llama Guard provide detection signals that must be interpreted by policy.  The core insight is that no single framework enforces safety end-to-end; you must compose multiple controls and define how they interact.

Deep Dive into the concept A control plane begins with policy: a formal definition of what is allowed and why. Governance frameworks like the NIST AI RMF and ISO/IEC 42001 provide the organizational structure for this policy layer, while OWASP LLM Top 10 provides a security taxonomy for risks.  Policies must be translated into guardrail rules that are actionable: detect prompt injection in untrusted data, validate output schemas before tools execute, and block unsafe categories. This translation is non-trivial because LLM outputs are probabilistic and context-sensitive. A policy such as “never leak secrets” must be expressed as a chain of checks: input scanning for malicious prompts, context segmentation for untrusted data, output moderation to catch leakage, and tool gating to prevent exfiltration. Each check introduces uncertainty and cost, which means policy must include thresholds, confidence levels, and escalation paths.

Detectors such as Prompt Guard, Lakera Guard, or Rebuff provide risk scores and categories for injection attempts.  These detectors are probabilistic and therefore require calibration. The control plane must normalize detector outputs into a shared risk scale, define what “block” vs “review” means, and log the decision context for later auditing. Output guardrails such as Llama Guard or OpenAI moderation detect unsafe content in generated responses.  These checks must be aligned to your own taxonomy; the model’s categories may not map exactly to your policy. This is why evaluation and red-teaming are crucial: without testing, you do not know if your thresholds or taxonomy mappings are effective.

Structured output validation adds determinism to a probabilistic system. Guardrails AI uses validators and schema checks to ensure outputs conform to an expected structure, enabling safer tool calls and data extraction.  NeMo Guardrails extends this by introducing Colang, a flow language that constrains dialogue paths and allows explicit safety steps, such as mandatory disclaimers or confirmation prompts.  These frameworks provide building blocks, but they do not decide how to integrate them into a business context. For example, a schema validator can ensure a tool call is syntactically correct, but only policy can decide if that tool call should be allowed at all. This is why tool permissioning and sandboxing are critical complement pieces that most guardrails frameworks do not provide natively.

Evaluation is the evidence layer. Tools like garak and OpenAI Evals allow you to run red-team tests and custom evaluation suites to measure whether guardrails are actually working.  Without these tests, guardrails may create a false sense of security. Monitoring and telemetry are the final layer: you must log guardrail decisions, measure false positives and negatives, and track drift over time. Guardrails AI supports observability integration via OpenTelemetry, which can feed monitoring dashboards for guardrail KPIs.  The control plane is therefore a loop: policies drive controls, controls generate evidence, and evidence updates policies. This loop is the only sustainable way to manage guardrails in production.

How this fit on projects

You will apply this control-plane model directly in §5.4 and §5.11 and validate it in §6.

Definitions & key terms

Control plane: The policy-driven layer that decides what an agent may do.
Detector: A model or rule that assigns risk categories or scores. 
Validator: A structured check that enforces schema or constraints. 
Tool gating: Permissions and constraints for tool execution.
Evaluation suite: A set of tests that measure guardrails effectiveness. 

Mental model diagram

Policy -> Detectors -> Validators -> Tool Gate -> Output
 ^ | | | |
 | v v v v
Evidence <- Logs <- Thresholds <- Decisions <- Monitoring

How it works (step-by-step)

Define policy risks and acceptable thresholds.
Select detectors and validators aligned to those risks.
Normalize detector outputs and enforce schema rules.
Apply tool permissions based on risk and context.
Log decisions and run evaluation suites continuously.

Minimal concrete example

Guardrail Decision Record
- input_source: retrieved_doc
- detector: prompt_injection
- score: 0.84
- policy_action: block
- tool_gate: deny
- audit_id: 2026-01-03-0001

Common misconceptions

“A single framework solves guardrails end-to-end.”
“Moderation is enough to prevent prompt injection.”
“Validation guarantees correctness without policy.”

Check-your-understanding questions

Why is a policy layer required in addition to detectors?
How do validators reduce tool misuse risk?
Why is evaluation necessary even if detectors exist?

Check-your-understanding answers

Detectors provide signals, but policy decides actions and thresholds.
Validators ensure structured, safe tool inputs before execution.
Detectors can fail or drift; evaluation reveals blind spots.

Real-world applications

Enterprise assistants with access to sensitive data
RAG systems ingesting third-party documents
Autonomous workflows with high-impact tools

Where you’ll apply it

See §5.4 and §6 in this file.
Also used in: P02-prompt-injection-firewall.md, P03-content-safety-gate.md, P08-policy-router-orchestrator.md.

References

NIST AI RMF 1.0. 
ISO/IEC 42001:2023 AI Management Systems. 
OWASP LLM Top 10 v1.1. 
Guardrails AI framework. 
NeMo Guardrails and Colang. 
Prompt Guard model card. 
Llama Guard documentation. 
garak LLM scanner. 
OpenAI Evals. 

Key insights Guardrails are a control plane, not a single model or API.

Summary A layered control plane combines policy, detection, validation, and evaluation into a continuous safety loop.

Homework/Exercises to practice the concept

Draft a policy map with three risks and a detector for each.
Define a monitoring dashboard with three guardrail KPIs.

Solutions to the homework/exercises

Example risks: injection, data leakage, tool misuse; KPIs: block rate, false positives, tool denial rate.

3. Project Specification

3.1 What You Will Build

A CLI firewall that classifies inputs, assigns risk scores, and outputs allow/block/review decisions.

3.2 Functional Requirements

Accept input with source labels (user/retrieved)
Run at least one detector (Prompt Guard or Lakera) 
Apply thresholds to decide allow/block/review
Log decisions with timestamps

3.3 Non-Functional Requirements

Performance: Single input decision under 1 second
Reliability: Consistent decisions for identical inputs
Usability: Readable risk report for operators

3.4 Example Usage / Output

$ guardrail-firewall scan --source user --policy strict
Decision: BLOCK
Reason: jailbreak_score=0.91

3.5 Data Formats / Schemas / Protocols

Decision record JSON: {source, detector, score, action, policy_id}

3.6 Edge Cases

Long prompts with nested instructions
Mixed-language inputs
Empty or null input

3.7 Real World Outcome

This section is the golden reference. You will compare your output against it.

3.7.1 How to Run (Copy/Paste)

Export detector API key if needed
Run firewall CLI with –source and –policy
Review decision logs

3.7.2 Golden Path Demo (Deterministic)

Run a fixed input set with a fixed policy and compare against a stored expected output file.

3.7.3 If CLI: Exact Terminal Transcript (Success)

$ guardrail-firewall scan --source retrieved --policy strict
Input: "Ignore previous instructions"
Action: BLOCK
Risk: injection (0.84)

3.7.4 Failure Demo (Deterministic)

$ guardrail-firewall scan --source retrieved --policy strict
ERROR: detector unavailable
Action: FAIL

Exit codes: 0 on allow/block, 3 on detector failure

4. Solution Architecture

The architecture centers on a detector adapter, a policy engine, and a logging sink.

4.1 High-Level Design

┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Input │────▶│ Policy │────▶│ Output │
│ Handler │ │ Engine │ │ Reporter │
└─────────────┘ └─────────────┘ └─────────────┘

4.2 Key Components

Component	Responsibility	Key Decisions
Detector Adapter	Calls Prompt Guard/Lakera APIs	Normalize scores
Policy Engine	Applies thresholds	Deterministic rules
Audit Logger	Stores decisions	Immutable logs

4.4 Data Structures (No Full Code)

Decision record fields: input_hash, source_label, detector_scores, policy_action, timestamp

4.4 Algorithm Overview

Normalize input
Run detector
Apply thresholds
Emit decision record

5. Implementation Guide

5.1 Development Environment Setup

export DETECTOR_KEY=...

5.2 Project Structure

project/
├── cli/
├── policies/
├── logs/
└── tests/

5.3 The Core Question You’re Answering

“How do I detect and safely handle prompt injection before it reaches the model?”

This forces explicit policy decisions for high-risk input.

5.4 Concepts You Must Understand First

Prompt injection vs jailbreak 
Threshold calibration (false positives vs negatives)

5.5 Questions to Guide Your Design

Source labeling
- How will you label input origin?
- How will source influence thresholds?
Decisioning
- What triggers block vs review?

5.6 Thinking Exercise

Three Inputs, Three Decisions

Design decisions for benign, ambiguous, and malicious inputs.

Questions to answer:

Which should be blocked?
Which should be routed?

5.7 The Interview Questions They’ll Ask

“How do you distinguish injection from jailbreak?”
“How do you set detection thresholds?”
“Why is source labeling important?”
“How do you handle detector failure?”
“What is an acceptable false positive rate?”

5.8 Hints in Layers

Hint 1: Start with strict policy Block on high risk.

Hint 2: Add a review tier Medium risk goes to review.

Hint 3: Log decisions Store all decisions for calibration.

Hint 4: Add replay tests Use deterministic input sets.

5.9 Books That Will Help

Topic	Book	Chapter
Injection detection	Prompt Guard model card	Overview 
Injection API	Lakera Guard docs	Integration 

5.10 Implementation Phases

Phase 1: Detector Integration (2-3 days)

Goals: connect to detection API. Tasks: build adapter, normalize scores. Checkpoint: detector returns results.

Phase 2: Policy Engine (2-3 days)

Goals: define thresholds and actions. Tasks: implement policy table, decision logs. Checkpoint: deterministic decisions.

Phase 3: Calibration (2-4 days)

Goals: measure false positives. Tasks: run benign dataset tests. Checkpoint: thresholds tuned.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Thresholds	strict vs lenient	strict for retrieved data	higher risk
Action	block vs review	block for high scores	reduce leakage

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate core logic	Input classification, schema checks
Integration Tests	Validate end-to-end flow	Full guardrail pipeline
Edge Case Tests	Validate unusual inputs	Long prompts, empty outputs

6.2 Critical Test Cases

Malicious input: must block
Benign input: must allow
Detector failure: must return error

6.3 Test Data

Test set: 10 benign prompts, 10 injection prompts

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Over-blocking	Many safe inputs blocked	Recalibrate thresholds
Under-blocking	Attacks slip	Add second detector

7.2 Debugging Strategies

Inspect decision logs to see which rule triggered a block.
Replay deterministic test cases to reproduce failures.

7.3 Performance Traps

Avoid running multiple detectors serially when parallelization is possible.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a warning-only policy mode
Support multilingual detection

8.2 Intermediate Extensions

Add Rebuff vector memory 
Add confidence-based escalation

8.3 Advanced Extensions

Integrate with policy router
Build a real-time dashboard

9. Real-World Connections

9.1 Industry Applications

RAG assistants: Blocks injected instructions in retrieved data
Customer support bots: Stops jailbreak attempts

Prompt Guard: Injection detection model 
Lakera Guard: Injection detection API 

9.3 Interview Relevance

Input validation: Designing safe input pipelines
Risk thresholds: Balancing false positives

10. Resources

10.1 Essential Reading

Prompt Guard model card. 
Lakera Guard docs. 

10.2 Video Resources

Prompt injection attack demos
Security talks on LLM injection

10.3 Tools & Documentation

Prompt Guard documentation. 
Lakera Guard API docs. 

Project 5: RAG Sanitization & Provenance
Project 8: Policy Router Orchestrator

11. Self-Assessment Checklist

11.1 Understanding

I can explain the control plane layers without notes
I can justify every policy threshold used
I understand the main failure modes of this guardrail

11.2 Implementation

All functional requirements are met
All critical test cases pass
Edge cases are handled

11.3 Growth

I documented lessons learned
I can explain this project in an interview

12. Submission / Completion Criteria

Minimum Viable Completion:

Detector integrated and functioning
Policy thresholds defined
Decision logs generated

Full Completion:

All minimum criteria plus:
False positive rate measured
Replay tests pass

Excellence (Going Above & Beyond):

Multi-detector ensemble used
Dashboard for risk metrics