Project 01: Threat Model Your Agent

Build a complete threat model and guardrails map for an AI agent system, including assets, trust boundaries, abuse cases, and control placement.

Quick Reference

Attribute Value
Difficulty Level 2
Time Estimate 6-10 hours
Main Programming Language Markdown
Alternative Programming Languages N/A
Coolness Level 3
Business Potential 4
Prerequisites Basic security concepts; ability to diagram systems
Key Topics threat modeling, OWASP LLM Top 10, policy mapping

1. Learning Objectives

By completing this project, you will:

  1. Identify assets, boundaries, and abuse cases
  2. Map OWASP risk categories to concrete controls
  3. Define acceptance criteria and evidence for each control
  4. Produce a reviewable threat model document

2. All Theory Needed (Per-Concept Breakdown)

Guardrails Control Plane Fundamentals

Fundamentals A guardrails control plane is the set of policies, detectors, validators, and decision rules that sit around an LLM agent to ensure it behaves safely and predictably. Unlike traditional input validation, guardrails must handle probabilistic outputs, ambiguous intent, and adversarial prompts. The control plane therefore spans the entire lifecycle of an interaction: input filtering, context validation (including RAG sources), output moderation, and tool-use permissioning. Frameworks such as Guardrails AI and NeMo Guardrails provide structured validation and dialogue control, while models like Prompt Guard or Llama Guard provide detection signals that must be interpreted by policy.  The core insight is that no single framework enforces safety end-to-end; you must compose multiple controls and define how they interact.

Deep Dive into the concept A control plane begins with policy: a formal definition of what is allowed and why. Governance frameworks like the NIST AI RMF and ISO/IEC 42001 provide the organizational structure for this policy layer, while OWASP LLM Top 10 provides a security taxonomy for risks.  Policies must be translated into guardrail rules that are actionable: detect prompt injection in untrusted data, validate output schemas before tools execute, and block unsafe categories. This translation is non-trivial because LLM outputs are probabilistic and context-sensitive. A policy such as “never leak secrets” must be expressed as a chain of checks: input scanning for malicious prompts, context segmentation for untrusted data, output moderation to catch leakage, and tool gating to prevent exfiltration. Each check introduces uncertainty and cost, which means policy must include thresholds, confidence levels, and escalation paths.

Detectors such as Prompt Guard, Lakera Guard, or Rebuff provide risk scores and categories for injection attempts.  These detectors are probabilistic and therefore require calibration. The control plane must normalize detector outputs into a shared risk scale, define what “block” vs “review” means, and log the decision context for later auditing. Output guardrails such as Llama Guard or OpenAI moderation detect unsafe content in generated responses.  These checks must be aligned to your own taxonomy; the model’s categories may not map exactly to your policy. This is why evaluation and red-teaming are crucial: without testing, you do not know if your thresholds or taxonomy mappings are effective.

Structured output validation adds determinism to a probabilistic system. Guardrails AI uses validators and schema checks to ensure outputs conform to an expected structure, enabling safer tool calls and data extraction.  NeMo Guardrails extends this by introducing Colang, a flow language that constrains dialogue paths and allows explicit safety steps, such as mandatory disclaimers or confirmation prompts.  These frameworks provide building blocks, but they do not decide how to integrate them into a business context. For example, a schema validator can ensure a tool call is syntactically correct, but only policy can decide if that tool call should be allowed at all. This is why tool permissioning and sandboxing are critical complement pieces that most guardrails frameworks do not provide natively.

Evaluation is the evidence layer. Tools like garak and OpenAI Evals allow you to run red-team tests and custom evaluation suites to measure whether guardrails are actually working.  Without these tests, guardrails may create a false sense of security. Monitoring and telemetry are the final layer: you must log guardrail decisions, measure false positives and negatives, and track drift over time. Guardrails AI supports observability integration via OpenTelemetry, which can feed monitoring dashboards for guardrail KPIs.  The control plane is therefore a loop: policies drive controls, controls generate evidence, and evidence updates policies. This loop is the only sustainable way to manage guardrails in production.

How this fit on projects

  • You will apply this control-plane model directly in §5.4 and §5.11 and validate it in §6.

Definitions & key terms

  • Control plane: The policy-driven layer that decides what an agent may do.
  • Detector: A model or rule that assigns risk categories or scores. 
  • Validator: A structured check that enforces schema or constraints. 
  • Tool gating: Permissions and constraints for tool execution.
  • Evaluation suite: A set of tests that measure guardrails effectiveness. 

Mental model diagram

Policy -> Detectors -> Validators -> Tool Gate -> Output
 ^ | | | |
 | v v v v
Evidence <- Logs <- Thresholds <- Decisions <- Monitoring

How it works (step-by-step)

  1. Define policy risks and acceptable thresholds.
  2. Select detectors and validators aligned to those risks.
  3. Normalize detector outputs and enforce schema rules.
  4. Apply tool permissions based on risk and context.
  5. Log decisions and run evaluation suites continuously.

Minimal concrete example

Guardrail Decision Record
- input_source: retrieved_doc
- detector: prompt_injection
- score: 0.84
- policy_action: block
- tool_gate: deny
- audit_id: 2026-01-03-0001

Common misconceptions

  • “A single framework solves guardrails end-to-end.”
  • “Moderation is enough to prevent prompt injection.”
  • “Validation guarantees correctness without policy.”

Check-your-understanding questions

  1. Why is a policy layer required in addition to detectors?
  2. How do validators reduce tool misuse risk?
  3. Why is evaluation necessary even if detectors exist?

Check-your-understanding answers

  1. Detectors provide signals, but policy decides actions and thresholds.
  2. Validators ensure structured, safe tool inputs before execution.
  3. Detectors can fail or drift; evaluation reveals blind spots.

Real-world applications

  • Enterprise assistants with access to sensitive data
  • RAG systems ingesting third-party documents
  • Autonomous workflows with high-impact tools

Where you’ll apply it

  • See §5.4 and §6 in this file.
  • Also used in: P02-prompt-injection-firewall.md, P03-content-safety-gate.md, P08-policy-router-orchestrator.md.

References

  • NIST AI RMF 1.0. 
  • ISO/IEC 42001:2023 AI Management Systems. 
  • OWASP LLM Top 10 v1.1. 
  • Guardrails AI framework. 
  • NeMo Guardrails and Colang. 
  • Prompt Guard model card. 
  • Llama Guard documentation. 
  • garak LLM scanner. 
  • OpenAI Evals. 

Key insights Guardrails are a control plane, not a single model or API.

Summary A layered control plane combines policy, detection, validation, and evaluation into a continuous safety loop.

Homework/Exercises to practice the concept

  • Draft a policy map with three risks and a detector for each.
  • Define a monitoring dashboard with three guardrail KPIs.

Solutions to the homework/exercises

  • Example risks: injection, data leakage, tool misuse; KPIs: block rate, false positives, tool denial rate.

3. Project Specification

3.1 What You Will Build

A threat model dossier with diagrams, a risk table, and a guardrail placement plan for a chosen agent workflow.

3.2 Functional Requirements

  1. Document assets, trust boundaries, and data flows
  2. Enumerate at least 10 abuse cases
  3. Map each abuse case to a guardrail control
  4. Define measurable tests for each control

3.3 Non-Functional Requirements

  • Performance: Not latency-sensitive; clarity and completeness matter
  • Reliability: Every risk has an owner and a mitigation
  • Usability: Readable by security and product stakeholders

3.4 Example Usage / Output

$ cat threat-model.md
[sections: assets, boundaries, abuse cases, controls, tests]

3.5 Data Formats / Schemas / Protocols

Document structure with headings: Assets, Boundaries, Abuse Cases, Controls, Tests, Owners

3.6 Edge Cases

  • Multiple tools with different privilege levels
  • Multi-tenant data access
  • Third-party data ingestion

3.7 Real World Outcome

This section is the golden reference. You will compare your output against it.

3.7.1 How to Run (Copy/Paste)

  • Open the project folder
  • Review the threat model template
  • Fill each section with system-specific details

3.7.2 Golden Path Demo (Deterministic)

Deterministic checklist review with all required sections completed and signed off by a peer reviewer.

3.7.3 If CLI: Exact Terminal Transcript (Success)

$ ls
architecture.md
threat-model.md
controls-map.md

$ grep -n "Abuse Case" threat-model.md
[10+ entries]

3.7.4 Failure Demo (Deterministic)

$ grep -n "Controls" threat-model.md
[0 entries]
ERROR: missing controls section

Exit codes: 0 on success, 2 if required sections are missing


4. Solution Architecture

The architecture is a documentation workflow: inputs are system diagrams and policies, outputs are a structured threat model and control map.

4.1 High-Level Design

┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Input │────▶│ Policy │────▶│ Output │
│ Handler │ │ Engine │ │ Reporter │
└─────────────┘ └─────────────┘ └─────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Diagrammer Produces data flow and trust boundary diagrams Use consistent notation
Risk Table Lists threats and controls Must map to OWASP categories
Review Log Tracks sign-off and revisions Requires owner attribution

4.4 Data Structures (No Full Code)

Threat record fields: asset, boundary, abuse_case, control, test, owner, residual_risk

4.4 Algorithm Overview

  1. Collect system data flows
  2. Identify boundaries and assets
  3. Enumerate abuses
  4. Map controls
  5. Define tests

5. Implementation Guide

5.1 Development Environment Setup

mkdir threat-model && cd threat-model

5.2 Project Structure

project/
├── threat-model.md
├── controls-map.md
├── architecture.md
└── review-log.md

5.3 The Core Question You’re Answering

“Where can my agent be manipulated or misused, and what guardrails must exist at each boundary?”

The threat model defines guardrail placement before implementation begins.

5.4 Concepts You Must Understand First

  1. OWASP LLM Top 10 - identify risk categories. 
  2. NIST AI RMF - map risks to controls. 

5.5 Questions to Guide Your Design

  1. Boundaries
    • Where does untrusted data enter?
    • Where do tools execute?
  2. Controls
    • Which risks must be blocked vs monitored?

5.6 Thinking Exercise

Draw the Boundary Map

Sketch the system and mark every trust boundary.

Questions to answer:

  • Which boundary is most likely to be abused?
  • Which guardrail catches it earliest?

5.7 The Interview Questions They’ll Ask

  1. “How do you map OWASP risks to guardrails?”
  2. “What is a trust boundary in an agent system?”
  3. “How do you decide what to block vs monitor?”
  4. “Why is governance needed in guardrails?”
  5. “How do you define residual risk?”

5.8 Hints in Layers

Hint 1: Start with assets List what cannot be lost.

Hint 2: Map boundaries Identify where untrusted data enters.

Hint 3: Build abuse stories Write attacker narratives.

Hint 4: Assign controls Attach one guardrail per abuse case.


5.9 Books That Will Help

Topic Book Chapter
Risk taxonomy OWASP LLM Top 10 v1.1 
Governance NIST AI RMF 1.0 Govern/Map 

5.10 Implementation Phases

Phase 1: Asset & Boundary Mapping (2-3 hours)

Goals: identify assets and boundaries. Tasks: draw system diagram, list inputs. Checkpoint: diagram reviewed.

Phase 2: Abuse Case Enumeration (2-3 hours)

Goals: list threats. Tasks: write 10 abuse stories. Checkpoint: risk table complete.

Phase 3: Control Mapping (2-4 hours)

Goals: assign guardrails. Tasks: map controls to each risk. Checkpoint: controls table validated.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Taxonomy OWASP vs custom OWASP baseline Industry standard
Review cadence Monthly vs quarterly Monthly High change rate

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate core logic Input classification, schema checks
Integration Tests Validate end-to-end flow Full guardrail pipeline
Edge Case Tests Validate unusual inputs Long prompts, empty outputs

6.2 Critical Test Cases

  1. Section completeness: All required headings exist
  2. Risk coverage: Each OWASP category has at least one control
  3. Review sign-off: Reviewer names are present

6.3 Test Data

Sample threat model template with placeholders filled

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing boundaries Controls unclear Add boundary diagram
No tests Risk unmeasured Add test for each control

7.2 Debugging Strategies

  • Inspect decision logs to see which rule triggered a block.
  • Replay deterministic test cases to reproduce failures.

7.3 Performance Traps

Not performance-sensitive, but avoid overly verbose documents that hide key risks.


8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a second reviewer and compare risk assessment
  • Add a residual risk rating per threat

8.2 Intermediate Extensions

  • Map controls to compliance requirements
  • Create a threat model for a multi-agent system

8.3 Advanced Extensions

  • Create a risk heatmap with severity and likelihood
  • Link threat model to incident response playbooks

9. Real-World Connections

9.1 Industry Applications

  • Enterprise assistants: Used to scope guardrails before deployment
  • RAG knowledge systems: Defines boundaries for untrusted data ingestion
  • OWASP LLM Top 10: Risk taxonomy reference 
  • NIST AI RMF: Governance framework 

9.3 Interview Relevance

  • Threat modeling: Explaining boundary risks
  • Policy mapping: Translating risk to controls

10. Resources

10.1 Essential Reading

  • NIST AI RMF 1.0. 
  • OWASP LLM Top 10 v1.1. 

10.2 Video Resources

  • OWASP LLM Top 10 project talks
  • NIST AI RMF overview webinars

10.3 Tools & Documentation

  • OWASP LLM Top 10 documentation. 
  • NIST AI RMF documentation. 
  • Project 2: Prompt Injection Firewall
  • Project 10: Production Guardrails Blueprint

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain the control plane layers without notes
  • I can justify every policy threshold used
  • I understand the main failure modes of this guardrail

11.2 Implementation

  • All functional requirements are met
  • All critical test cases pass
  • Edge cases are handled

11.3 Growth

  • I documented lessons learned
  • I can explain this project in an interview

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Threat model document complete
  • At least 10 abuse cases listed
  • Controls mapped to each abuse case

Full Completion:

  • All minimum criteria plus:
  • Peer review completed
  • Acceptance criteria defined

Excellence (Going Above & Beyond):

  • Risk heatmap included
  • Incident response linkage added