Project 16: Prompt Injection and Tool Exploit Red Team Lab
Build a security evaluation harness that stress-tests agent behavior under adversarial conditions.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 12-22 hours |
| Language | Python (alt: TypeScript) |
| Prerequisites | Projects 3, 6, 9 |
| Key Topics | adversarial tests, policy precision/recall, regression gates |
Learning Objectives
- Define attack classes relevant to your agent surface.
- Build repeatable test suites for injection and exfiltration attempts.
- Measure unsafe-pass and false-positive rates.
- Integrate security thresholds into CI gates.
The Core Question You’re Answering
“How do you demonstrate security posture with data instead of intuition?”
Concepts You Must Understand First
| Concept | Why It Matters | Where to Learn |
|---|---|---|
| Prompt injection patterns | Primary LLM attack class | security papers + guardrail docs |
| Evaluation dataset quality | Avoid false confidence from noisy tests | SWE-bench Verified context |
| Security scorecards | Deployment gate criteria | AppSec testing principles |
Theoretical Foundation
Attack Corpus -> Agent Under Test -> Policy + Verifier -> Scorecard -> Gate Decision
Security is a continuous measurement loop, not one-time hardening.
Project Specification
What You’ll Build
A lab runner that:
- Executes 100+ adversarial cases
- Labels failures by category
- Produces reproducible traces
- Fails CI when risk threshold is exceeded
Functional Requirements
- Attack corpus loader and mutator
- Deterministic execution mode
- Failure taxonomy and triage export
- Security threshold gating
Non-Functional Requirements
- Fast reruns for nightly CI
- Clear reproduction steps for each failure
- Historical trend tracking
Real World Outcome
$ python p16_red_team.py --suite attack_pack_v3
[tests] 120 loaded
[blocked] 103
[unsafe_pass] 5
[false_positive] 12
[security_score] 95.8%
[artifact] red_team_report.html + failing_cases.json
Architecture Overview
Corpus Manager -> Runner -> Evaluator -> Report Generator -> CI Gate
Implementation Guide
Phase 1: Corpus + Baseline
- Build initial attack packs and baseline score.
Phase 2: Policy Tuning
- Reduce unsafe-pass without overblocking benign tasks.
Phase 3: CI Integration
- Enforce gate thresholds and trend alerts.
Testing Strategy
- Mutation fuzzing for prompt variants
- Benign control set for false-positive measurement
- Historical replay of previous incident prompts
Common Pitfalls & Debugging
| Pitfall | Symptom | Fix |
|---|---|---|
| Toy attack set | inflated scores | derive attacks from real workflows |
| Overblocking | user tasks fail | track false-positive by task class |
| Irreproducible failures | cannot debug | persist exact prompts + states + tool traces |
Interview Questions They’ll Ask
- How do you define unsafe-pass vs false-positive?
- How do you build realistic attack corpora?
- How do you tune guardrails without crippling utility?
- What belongs in a security deployment gate?
Hints in Layers
- Hint 1: Start with deterministic, labeled attacks.
- Hint 2: Add benign controls early.
- Hint 3: Version your corpus and thresholds.
- Hint 4: Track trendline, not only latest run.
Submission / Completion Criteria
Minimum Completion
- Red-team harness runs and outputs categorized scorecard
Full Completion
- CI gating with reproducible failure traces
Excellence
- Automated corpus mutation and risk-prioritized remediation queue