Project 16: Prompt Injection and Tool Exploit Red Team Lab

Build a security evaluation harness that stress-tests agent behavior under adversarial conditions.


Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 12-22 hours
Language Python (alt: TypeScript)
Prerequisites Projects 3, 6, 9
Key Topics adversarial tests, policy precision/recall, regression gates

Learning Objectives

  1. Define attack classes relevant to your agent surface.
  2. Build repeatable test suites for injection and exfiltration attempts.
  3. Measure unsafe-pass and false-positive rates.
  4. Integrate security thresholds into CI gates.

The Core Question You’re Answering

“How do you demonstrate security posture with data instead of intuition?”


Concepts You Must Understand First

Concept Why It Matters Where to Learn
Prompt injection patterns Primary LLM attack class security papers + guardrail docs
Evaluation dataset quality Avoid false confidence from noisy tests SWE-bench Verified context
Security scorecards Deployment gate criteria AppSec testing principles

Theoretical Foundation

Attack Corpus -> Agent Under Test -> Policy + Verifier -> Scorecard -> Gate Decision

Security is a continuous measurement loop, not one-time hardening.


Project Specification

What You’ll Build

A lab runner that:

  • Executes 100+ adversarial cases
  • Labels failures by category
  • Produces reproducible traces
  • Fails CI when risk threshold is exceeded

Functional Requirements

  1. Attack corpus loader and mutator
  2. Deterministic execution mode
  3. Failure taxonomy and triage export
  4. Security threshold gating

Non-Functional Requirements

  • Fast reruns for nightly CI
  • Clear reproduction steps for each failure
  • Historical trend tracking

Real World Outcome

$ python p16_red_team.py --suite attack_pack_v3
[tests] 120 loaded
[blocked] 103
[unsafe_pass] 5
[false_positive] 12
[security_score] 95.8%
[artifact] red_team_report.html + failing_cases.json

Architecture Overview

Corpus Manager -> Runner -> Evaluator -> Report Generator -> CI Gate

Implementation Guide

Phase 1: Corpus + Baseline

  • Build initial attack packs and baseline score.

Phase 2: Policy Tuning

  • Reduce unsafe-pass without overblocking benign tasks.

Phase 3: CI Integration

  • Enforce gate thresholds and trend alerts.

Testing Strategy

  • Mutation fuzzing for prompt variants
  • Benign control set for false-positive measurement
  • Historical replay of previous incident prompts

Common Pitfalls & Debugging

Pitfall Symptom Fix
Toy attack set inflated scores derive attacks from real workflows
Overblocking user tasks fail track false-positive by task class
Irreproducible failures cannot debug persist exact prompts + states + tool traces

Interview Questions They’ll Ask

  1. How do you define unsafe-pass vs false-positive?
  2. How do you build realistic attack corpora?
  3. How do you tune guardrails without crippling utility?
  4. What belongs in a security deployment gate?

Hints in Layers

  • Hint 1: Start with deterministic, labeled attacks.
  • Hint 2: Add benign controls early.
  • Hint 3: Version your corpus and thresholds.
  • Hint 4: Track trendline, not only latest run.

Submission / Completion Criteria

Minimum Completion

  • Red-team harness runs and outputs categorized scorecard

Full Completion

  • CI gating with reproducible failure traces

Excellence

  • Automated corpus mutation and risk-prioritized remediation queue