Project 03: Content Safety Gate
Build a post-generation content safety gate that moderates outputs and enforces policy actions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 3 |
| Time Estimate | 1-2 weeks |
| Main Programming Language | Python |
| Alternative Programming Languages | JavaScript, Rust |
| Coolness Level | 4 |
| Business Potential | 5 |
| Prerequisites | Moderation concepts, basic API usage |
| Key Topics | content moderation, taxonomy mapping, thresholds |
1. Learning Objectives
By completing this project, you will:
- Integrate an output moderation model
- Define category-specific thresholds
- Implement safe fallback responses
- Log moderation decisions
2. All Theory Needed (Per-Concept Breakdown)
Guardrails Control Plane Fundamentals
Fundamentals A guardrails control plane is the set of policies, detectors, validators, and decision rules that sit around an LLM agent to ensure it behaves safely and predictably. Unlike traditional input validation, guardrails must handle probabilistic outputs, ambiguous intent, and adversarial prompts. The control plane therefore spans the entire lifecycle of an interaction: input filtering, context validation (including RAG sources), output moderation, and tool-use permissioning. Frameworks such as Guardrails AI and NeMo Guardrails provide structured validation and dialogue control, while models like Prompt Guard or Llama Guard provide detection signals that must be interpreted by policy. The core insight is that no single framework enforces safety end-to-end; you must compose multiple controls and define how they interact.
Deep Dive into the concept A control plane begins with policy: a formal definition of what is allowed and why. Governance frameworks like the NIST AI RMF and ISO/IEC 42001 provide the organizational structure for this policy layer, while OWASP LLM Top 10 provides a security taxonomy for risks. Policies must be translated into guardrail rules that are actionable: detect prompt injection in untrusted data, validate output schemas before tools execute, and block unsafe categories. This translation is non-trivial because LLM outputs are probabilistic and context-sensitive. A policy such as “never leak secrets” must be expressed as a chain of checks: input scanning for malicious prompts, context segmentation for untrusted data, output moderation to catch leakage, and tool gating to prevent exfiltration. Each check introduces uncertainty and cost, which means policy must include thresholds, confidence levels, and escalation paths.
Detectors such as Prompt Guard, Lakera Guard, or Rebuff provide risk scores and categories for injection attempts. These detectors are probabilistic and therefore require calibration. The control plane must normalize detector outputs into a shared risk scale, define what “block” vs “review” means, and log the decision context for later auditing. Output guardrails such as Llama Guard or OpenAI moderation detect unsafe content in generated responses. These checks must be aligned to your own taxonomy; the model’s categories may not map exactly to your policy. This is why evaluation and red-teaming are crucial: without testing, you do not know if your thresholds or taxonomy mappings are effective.
Structured output validation adds determinism to a probabilistic system. Guardrails AI uses validators and schema checks to ensure outputs conform to an expected structure, enabling safer tool calls and data extraction. NeMo Guardrails extends this by introducing Colang, a flow language that constrains dialogue paths and allows explicit safety steps, such as mandatory disclaimers or confirmation prompts. These frameworks provide building blocks, but they do not decide how to integrate them into a business context. For example, a schema validator can ensure a tool call is syntactically correct, but only policy can decide if that tool call should be allowed at all. This is why tool permissioning and sandboxing are critical complement pieces that most guardrails frameworks do not provide natively.
Evaluation is the evidence layer. Tools like garak and OpenAI Evals allow you to run red-team tests and custom evaluation suites to measure whether guardrails are actually working. Without these tests, guardrails may create a false sense of security. Monitoring and telemetry are the final layer: you must log guardrail decisions, measure false positives and negatives, and track drift over time. Guardrails AI supports observability integration via OpenTelemetry, which can feed monitoring dashboards for guardrail KPIs. The control plane is therefore a loop: policies drive controls, controls generate evidence, and evidence updates policies. This loop is the only sustainable way to manage guardrails in production.
How this fit on projects
- You will apply this control-plane model directly in §5.4 and §5.11 and validate it in §6.
Definitions & key terms
- Control plane: The policy-driven layer that decides what an agent may do.
- Detector: A model or rule that assigns risk categories or scores.
- Validator: A structured check that enforces schema or constraints.
- Tool gating: Permissions and constraints for tool execution.
- Evaluation suite: A set of tests that measure guardrails effectiveness.
Mental model diagram
Policy -> Detectors -> Validators -> Tool Gate -> Output
^ | | | |
| v v v v
Evidence <- Logs <- Thresholds <- Decisions <- Monitoring
How it works (step-by-step)
- Define policy risks and acceptable thresholds.
- Select detectors and validators aligned to those risks.
- Normalize detector outputs and enforce schema rules.
- Apply tool permissions based on risk and context.
- Log decisions and run evaluation suites continuously.
Minimal concrete example
Guardrail Decision Record
- input_source: retrieved_doc
- detector: prompt_injection
- score: 0.84
- policy_action: block
- tool_gate: deny
- audit_id: 2026-01-03-0001
Common misconceptions
- “A single framework solves guardrails end-to-end.”
- “Moderation is enough to prevent prompt injection.”
- “Validation guarantees correctness without policy.”
Check-your-understanding questions
- Why is a policy layer required in addition to detectors?
- How do validators reduce tool misuse risk?
- Why is evaluation necessary even if detectors exist?
Check-your-understanding answers
- Detectors provide signals, but policy decides actions and thresholds.
- Validators ensure structured, safe tool inputs before execution.
- Detectors can fail or drift; evaluation reveals blind spots.
Real-world applications
- Enterprise assistants with access to sensitive data
- RAG systems ingesting third-party documents
- Autonomous workflows with high-impact tools
Where you’ll apply it
- See §5.4 and §6 in this file.
- Also used in: P02-prompt-injection-firewall.md, P03-content-safety-gate.md, P08-policy-router-orchestrator.md.
References
- NIST AI RMF 1.0.
- ISO/IEC 42001:2023 AI Management Systems.
- OWASP LLM Top 10 v1.1.
- Guardrails AI framework.
- NeMo Guardrails and Colang.
- Prompt Guard model card.
- Llama Guard documentation.
- garak LLM scanner.
- OpenAI Evals.
Key insights Guardrails are a control plane, not a single model or API.
Summary A layered control plane combines policy, detection, validation, and evaluation into a continuous safety loop.
Homework/Exercises to practice the concept
- Draft a policy map with three risks and a detector for each.
- Define a monitoring dashboard with three guardrail KPIs.
Solutions to the homework/exercises
- Example risks: injection, data leakage, tool misuse; KPIs: block rate, false positives, tool denial rate.
3. Project Specification
3.1 What You Will Build
A content moderation gate that checks model outputs and blocks or rewrites unsafe content.
3.2 Functional Requirements
- Accept model outputs and run moderation (Llama Guard or OpenAI moderation)
- Apply policy thresholds per category
- Provide fallback responses
- Log category scores
3.3 Non-Functional Requirements
- Performance: Moderation check under 1 second
- Reliability: Consistent categorization across runs
- Usability: Clear user messaging when blocked
3.4 Example Usage / Output
$ safety-gate check --policy public-chat
Decision: BLOCK
Category: illegal_activity
3.5 Data Formats / Schemas / Protocols
Moderation record: {category, score, action, policy_id}
3.6 Edge Cases
- Long multi-topic responses
- Mixed languages
- Responses with quoted unsafe content
3.7 Real World Outcome
This section is the golden reference. You will compare your output against it.
3.7.1 How to Run (Copy/Paste)
- Provide output text via CLI or file
- Apply policy profile
- Review decision logs
3.7.2 Golden Path Demo (Deterministic)
Run a fixed output set and compare block/allow outcomes to a stored expected file.
3.7.3 If CLI: Exact Terminal Transcript (Success)
$ safety-gate check --policy public-chat
Output: "How to make a harmful device"
Action: BLOCK
Category: violence
3.7.4 Failure Demo (Deterministic)
$ safety-gate check --policy public-chat
ERROR: moderation service unavailable
Action: FAIL
Exit codes: 0 on allow/block, 4 on moderation service failure
4. Solution Architecture
The architecture is a moderation adapter plus policy decision layer with fallback generation.
4.1 High-Level Design
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Input │────▶│ Policy │────▶│ Output │
│ Handler │ │ Engine │ │ Reporter │
└─────────────┘ └─────────────┘ └─────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Moderation Adapter | Calls Llama Guard/OpenAI moderation | Category mapping |
| Policy Layer | Applies thresholds | Category-specific actions |
| Fallback Builder | Generates safe alternatives | UX consistency |
4.4 Data Structures (No Full Code)
Moderation decision fields: category, score, threshold, action, timestamp
4.4 Algorithm Overview
- Run moderation
- Map categories
- Apply thresholds
- Return action
5. Implementation Guide
5.1 Development Environment Setup
export MODERATION_KEY=...
5.2 Project Structure
project/
├── moderation/
├── policies/
├── logs/
└── tests/
5.3 The Core Question You’re Answering
“How do I ensure outputs comply with policy even if inputs are benign?”
Moderation protects against unsafe generation and policy drift.
5.4 Concepts You Must Understand First
- Content moderation models
- Category thresholding
5.5 Questions to Guide Your Design
- Category actions
- Which categories are block-only?
- Which allow with warning?
- Fallbacks
- Do you rewrite or refuse?
5.6 Thinking Exercise
Threshold Design
Pick thresholds for a kids app vs internal tool.
Questions to answer:
- Which category is strictest?
- What is acceptable false positive rate?
5.7 The Interview Questions They’ll Ask
- “Why is output moderation necessary?”
- “How do you map taxonomy to policy?”
- “How do you handle borderline cases?”
- “What is the cost of over-blocking?”
- “How do you test moderation?”
5.8 Hints in Layers
Hint 1: Start with a policy matrix Define category actions.
Hint 2: Log raw scores Keep evidence for calibration.
Hint 3: Add safe fallbacks Provide alternative responses.
Hint 4: Re-test after changes Run deterministic suites.
5.9 Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Llama Guard documentation | Llama Guard documentation | Overview |
| Moderation APIs | OpenAI moderation docs | Overview |
5.10 Implementation Phases
Phase 1: Moderation Integration (2-3 days)
Goals: connect moderation API. Tasks: adapter, category mapping. Checkpoint: categories returned.
Phase 2: Policy Thresholds (2-3 days)
Goals: set thresholds. Tasks: policy config, fallback actions. Checkpoint: decisions deterministic.
Phase 3: Calibration (2-4 days)
Goals: tune thresholds. Tasks: run test sets. Checkpoint: false positives measured.
5.11 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Category action | block vs rewrite | block for high-risk | safety |
| Threshold | strict vs lenient | strict for public | reduce harm |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit Tests | Validate core logic | Input classification, schema checks |
| Integration Tests | Validate end-to-end flow | Full guardrail pipeline |
| Edge Case Tests | Validate unusual inputs | Long prompts, empty outputs |
6.2 Critical Test Cases
- Unsafe content: must block
- Benign content: must allow
- Service failure: must error
6.3 Test Data
Test set: 10 unsafe outputs, 10 benign outputs
7. Common Pitfalls & Debugging
7.1 Frequent Mistakes
| Pitfall | Symptom | Solution |
|---|---|---|
| Over-blocking | Users see refusals | Adjust thresholds |
| Category mismatch | Wrong action | Remap taxonomy |
7.2 Debugging Strategies
- Inspect decision logs to see which rule triggered a block.
- Replay deterministic test cases to reproduce failures.
7.3 Performance Traps
Avoid double-moderation without clear benefit.
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a warning-only policy
- Support multilingual outputs
8.2 Intermediate Extensions
- Add a second moderation model
- Auto-generate safe alternatives
8.3 Advanced Extensions
- Integrate with policy router
- Real-time moderation dashboard
9. Real-World Connections
9.1 Industry Applications
- Public chatbots: Block unsafe outputs
- Content generation tools: Enforce safety categories
9.2 Related Open Source Projects
- Llama Guard: Moderation classifier
- OpenAI moderation: Moderation API
9.3 Interview Relevance
- Safety policy: Mapping categories to actions
- Threshold tuning: Balancing safety and utility
10. Resources
10.1 Essential Reading
- Llama Guard documentation.
- OpenAI moderation docs.
10.2 Video Resources
- Moderation system design talks
- Safety policy calibration talks
10.3 Tools & Documentation
- Llama Guard docs.
- OpenAI moderation docs.
10.4 Related Projects in This Series
- Project 7: NeMo Guardrails Conversation Flow
- Project 8: Policy Router Orchestrator
11. Self-Assessment Checklist
11.1 Understanding
- I can explain the control plane layers without notes
- I can justify every policy threshold used
- I understand the main failure modes of this guardrail
11.2 Implementation
- All functional requirements are met
- All critical test cases pass
- Edge cases are handled
11.3 Growth
- I documented lessons learned
- I can explain this project in an interview
12. Submission / Completion Criteria
Minimum Viable Completion:
- Moderation integrated
- Policy thresholds defined
- Fallback responses implemented
Full Completion:
- All minimum criteria plus:
- Calibration complete
- Decision logs audited
Excellence (Going Above & Beyond):
- Dual-model moderation
- User feedback loop