Project 16: Prompt Injection and Tool Exploit Red Team Lab

Build a security evaluation harness that stress-tests agent behavior under adversarial conditions.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	12-22 hours
Language	Python (alt: TypeScript)
Prerequisites	Projects 3, 6, 9
Key Topics	adversarial tests, policy precision/recall, regression gates

Learning Objectives

Define attack classes relevant to your agent surface.
Build repeatable test suites for injection and exfiltration attempts.
Measure unsafe-pass and false-positive rates.
Integrate security thresholds into CI gates.

The Core Question You’re Answering

“How do you demonstrate security posture with data instead of intuition?”

Concepts You Must Understand First

Concept	Why It Matters	Where to Learn
Prompt injection patterns	Primary LLM attack class	security papers + guardrail docs
Evaluation dataset quality	Avoid false confidence from noisy tests	SWE-bench Verified context
Security scorecards	Deployment gate criteria	AppSec testing principles

Theoretical Foundation

Attack Corpus -> Agent Under Test -> Policy + Verifier -> Scorecard -> Gate Decision

Security is a continuous measurement loop, not one-time hardening.

Project Specification

What You’ll Build

A lab runner that:

Executes 100+ adversarial cases
Labels failures by category
Produces reproducible traces
Fails CI when risk threshold is exceeded

Functional Requirements

Attack corpus loader and mutator
Deterministic execution mode
Failure taxonomy and triage export
Security threshold gating

Non-Functional Requirements

Fast reruns for nightly CI
Clear reproduction steps for each failure
Historical trend tracking

Real World Outcome

$ python p16_red_team.py --suite attack_pack_v3
[tests] 120 loaded
[blocked] 103
[unsafe_pass] 5
[false_positive] 12
[security_score] 95.8%
[artifact] red_team_report.html + failing_cases.json

Architecture Overview

Corpus Manager -> Runner -> Evaluator -> Report Generator -> CI Gate

Implementation Guide

Phase 1: Corpus + Baseline

Build initial attack packs and baseline score.

Phase 2: Policy Tuning

Reduce unsafe-pass without overblocking benign tasks.

Phase 3: CI Integration

Enforce gate thresholds and trend alerts.

Testing Strategy

Mutation fuzzing for prompt variants
Benign control set for false-positive measurement
Historical replay of previous incident prompts

Common Pitfalls & Debugging

Pitfall	Symptom	Fix
Toy attack set	inflated scores	derive attacks from real workflows
Overblocking	user tasks fail	track false-positive by task class
Irreproducible failures	cannot debug	persist exact prompts + states + tool traces

Interview Questions They’ll Ask

How do you define unsafe-pass vs false-positive?
How do you build realistic attack corpora?
How do you tune guardrails without crippling utility?
What belongs in a security deployment gate?

Hints in Layers

Hint 1: Start with deterministic, labeled attacks.
Hint 2: Add benign controls early.
Hint 3: Version your corpus and thresholds.
Hint 4: Track trendline, not only latest run.

Submission / Completion Criteria

Minimum Completion

Red-team harness runs and outputs categorized scorecard

Full Completion

CI gating with reproducible failure traces

Excellence

Automated corpus mutation and risk-prioritized remediation queue