Project 28: Headless Testing Framework - Automated Test Generation
Project 28: Headless Testing Framework - Automated Test Generation
Build an automated test generation system using headless Claude: analyze code to generate tests, run tests and fix failures iteratively, achieve coverage targets, and integrate with CI.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Expert |
| Time Estimate | 3 weeks |
| Language | Python (Alternatives: TypeScript, Go) |
| Prerequisites | Projects 24-27 completed, testing experience |
| Key Topics | TDD, test generation, coverage analysis, iterative refinement, CI integration |
| Main Book | โTest Driven Developmentโ by Kent Beck |
1. Learning Objectives
By completing this project, you will:
- Automate test generation: Use Claude to create comprehensive test suites
- Implement the TDD loop: Generate, run, fix, repeat until targets met
- Parse test output: Extract failures and feed back to Claude for fixes
- Track coverage metrics: Measure and optimize test coverage
- Handle iteration limits: Know when to stop and escalate
- Integrate with CI: Run automated test generation in pipelines
- Design quality prompts: Craft prompts that produce good tests
2. Real World Outcome
When complete, youโll have a system that automatically generates and refines tests:
$ python auto-test.py --source ./src/auth --target-coverage 80
Automated Test Generation
Phase 1: Analyze code
----- Files: 5
----- Functions: 23
----- Current coverage: 45%
----- Gap to 80%: 35%
Phase 2: Generate tests
----- Generating tests for login()
----- Generating tests for logout()
----- Generating tests for refresh_token()
----- Generating tests for validate_session()
...
Phase 3: Run tests
----- Tests run: 42
----- Passed: 38
----- Failed: 4
----- Coverage: 72%
Phase 4: Fix failing tests (iteration 1)
----- Fixing test_login_invalid_password
----- Fixing test_token_expiry
----- Fixing test_session_validation_edge_case
----- Fixing test_concurrent_logout
Phase 5: Run tests (iteration 2)
----- Tests run: 42
----- Passed: 42
----- Failed: 0
----- Coverage: 83%
Target coverage achieved!
Generated files:
- tests/test_login.py
- tests/test_logout.py
- tests/test_token.py
- tests/test_session.py
The TDD Automation Loop
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AUTOMATED TDD LOOP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ START โโโบ Analyze โโโบ Generate โโโบ Run โโโฌโโโบ Coverage met? โ
โ โ โ โ
โ โฒ โ โผ โ
โ โ โ YES: Done! โ
โ โ โ NO: โโโโ โ
โ โ โ โ โ
โ โโโโโ Fix failures โโโโโโโโโโโดโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
3. The Core Question Youโre Answering
โHow do I use Claude to automatically generate, run, and iterate on tests until coverage targets are met?โ
TDD requires iterative refinement. This project automates the entire cycle: generate tests, run them, fix failures, and repeat until quality targets are achieved. The result is a testing assistant that handles the mechanical work while you focus on design.
Why This Matters
Manual test writing is:
- Time-consuming (often 50%+ of development time)
- Tedious (especially for edge cases)
- Inconsistent (quality varies by developer)
- Often skipped (when deadlines loom)
Automated test generation provides:
- Consistent baseline coverage
- Edge cases you might miss
- Faster initial test suite creation
- Learning opportunity (study generated tests)
4. Concepts You Must Understand First
Stop and research these before coding:
4.1 Test Generation Fundamentals
What makes a good test?
# GOOD TEST: Clear, focused, independent
def test_login_with_valid_credentials():
"""User can log in with correct username and password."""
user = create_test_user(password="secret123")
result = login(user.email, "secret123")
assert result.success is True
assert result.token is not None
# BAD TEST: Unclear, testing too much
def test_login():
# Multiple assertions, unclear what's being tested
login("a@b.com", "123")
login("c@d.com", "456")
# What exactly is this testing?
Key qualities:
- Focused: One test = one behavior
- Named clearly: Describes whatโs tested
- Independent: No test-order dependencies
- Deterministic: Same result every run
Reference: โTest Driven Developmentโ by Kent Beck, Chapter 1-5
4.2 Coverage Analysis
# Run pytest with coverage
pytest --cov=src --cov-report=json tests/
# coverage.json structure
{
"totals": {
"covered_lines": 450,
"num_statements": 600,
"percent_covered": 75.0
},
"files": {
"src/auth/login.py": {
"covered_lines": [1,2,5,6,7],
"missing_lines": [10,11,15],
"percent_covered": 70.0
}
}
}
Coverage types:
- Line coverage: Which lines executed
- Branch coverage: Which if/else branches taken
- Function coverage: Which functions called
- Path coverage: Which execution paths taken
Reference: pytest-cov documentation
4.3 Iterative Refinement
Using --continue maintains context across iterations:
# Initial generation
claude -p "Generate tests for login.py" --output-format json
# Returns session_id: "abc123"
# Fix failures with context
claude -p "Fix this failing test: $ERROR" --continue abc123
# Claude remembers the code and previous tests
Key Questions:
- How many iterations before giving up?
- When to start fresh vs continue?
- How to provide useful error context?
5. Questions to Guide Your Design
5.1 What Tests to Generate?
| Test Type | Complexity | Claudeโs Strength |
|---|---|---|
| Unit tests | Low | Excellent |
| Integration tests | Medium | Good |
| Edge case tests | Medium | Excellent |
| Error handling tests | Medium | Good |
| Performance tests | High | Limited |
| Security tests | High | Good |
5.2 How to Handle Failures?
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ FAILURE HANDLING MATRIX โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Failure Type โ Action โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ Syntax error โ Fix immediately (obvious issue) โ
โ Import error โ Add missing imports โ
โ Assertion failure โ Analyze, may be test or code bug โ
โ Runtime error โ Check mocking, dependencies โ
โ Timeout โ Simplify or skip test โ
โ Flaky (intermittent) โ Add retries or mark as flaky โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
5.3 When to Stop?
Stopping conditions:
- Coverage target reached
- All tests pass
- Maximum iterations exceeded
- No progress (same failures repeating)
- Human intervention required
6. Thinking Exercise
Design the TDD Automation Loop
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ TDD AUTOMATION LOOP โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ Start: Source code + Coverage target (e.g., 80%) โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 1. ANALYZE CODE โ Find untested functions โ
โ โ - Parse AST โ List public functions/methods โ
โ โ - Get coverageโ Identify gaps โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 2. GENERATE โ Claude: create tests for each gap โ
โ โ TESTS โ Include edge cases, error handling โ
โ โ โ Use session for context โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 3. WRITE FILES โ Save generated tests to test files โ
โ โ โ tests/test_<module>.py โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 4. RUN TESTS โ pytest --cov --tb=short โ
โ โ โ Capture output โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โโโโโ All pass & coverage >= target โโโถ DONE! โ
โ โ โ
โ โผ (failures or low coverage) โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 5. PARSE OUTPUT โ Extract failure messages โ
โ โ โ Format for Claude โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 6. FIX FAILURES โ Claude: fix based on errors โ
โ โ --continue โ Use session context โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ โ
โ โ iteration++ โ
โ โ โ
โ โโโโโ iteration < max โโโโโโโโโโโ โ
โ โ โ โ
โ โผ โ โ
โ โโโโโโโโโโโโโโโโโโโ โ โ
โ โ 7. CHECK LIMITS โ โ โ
โ โ Max iterations? โโโโ NO โโโโโโโโโโโโโโโโ โ
โ โ โ (loop to step 4) โ
โ โโโโโโโโโโฌโโโโโโโโโ โ
โ โ YES โ
โ โผ โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ 8. REPORT โ Partial results + remaining failures โ
โ โ โ Human intervention needed โ
โ โโโโโโโโโโโโโโโโโโโ โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Design Questions
- How many iterations before giving up?
- Typically 3-5 for most failures
- More iterations = diminishing returns
- Track if same error repeats
- Should you fix one test at a time or all at once?
- One at a time: Clearer feedback
- All at once: Faster, but harder to debug
- Hybrid: Batch similar errors
- How do you handle flaky tests?
- Retry 2-3 times before marking as flaky
- Add
@pytest.mark.flakyif persistent - Consider test isolation issues
7. The Interview Questions Theyโll Ask
7.1 โHow would you automate test generation for a codebase?โ
Good answer structure:
- Analyze code to identify functions/methods
- Generate tests with Claude using structured prompts
- Run tests and capture output
- Parse failures and iterate
- Stop when coverage target met or max iterations reached
7.2 โWhatโs the TDD cycle and how would you automate it?โ
TDD cycle:
- Red: Write failing test
- Green: Make it pass
- Refactor: Clean up
Automation approach:
- Generate tests (Red is implicit - tests define expected behavior)
- Fix failures (Green)
- Improve tests in next iteration (Refactor)
7.3 โHow do you measure test quality beyond coverage?โ
Quality metrics:
- Mutation testing: Do tests catch injected bugs?
- Assertion density: Tests with more assertions find more bugs
- Test isolation: Independent tests are more reliable
- Flakiness rate: How often do tests fail randomly?
- Execution time: Fast tests run more often
7.4 โHow do you handle test failures in an automated pipeline?โ
Strategies:
- Parse error messages for context
- Feed errors back to Claude with session context
- Retry with exponential backoff
- After max retries, mark as needing human review
- Generate report of remaining failures
7.5 โWhat are the limits of AI-generated tests?โ
Limitations:
- May miss domain-specific edge cases
- Can generate superficial tests (just calling functions)
- May not understand business logic
- Can create brittle tests tied to implementation
- Requires human review for quality
Best used for:
- Initial test scaffolding
- Edge case suggestions
- Boilerplate test code
- Coverage gap identification
8. Hints in Layers
Hint 1: Start with One File
Generate tests for a single file first:
def generate_tests_for_file(filepath: str) -> str:
prompt = f"""Generate pytest tests for this file:
{read_file(filepath)}
Include:
- Tests for each public function
- Edge cases (empty input, None, boundaries)
- Error handling tests
Output as valid Python code."""
return run_claude(prompt)
Hint 2: Parse pytest Output
Use --tb=short for concise errors:
def run_tests():
result = subprocess.run(
["pytest", "--tb=short", "-v", "--cov=src", "--cov-report=json"],
capture_output=True,
text=True
)
return {
"passed": result.returncode == 0,
"output": result.stdout,
"errors": result.stderr
}
Hint 3: Use โcontinue for Context
Maintain session across iterations:
class TDDAutomator:
def __init__(self):
self.session_id = None
def generate_initial(self, prompt):
result = run_claude(prompt, output_format="json")
self.session_id = result["session_id"]
return result
def fix_failures(self, errors):
return run_claude(
f"Fix these test failures:\n{errors}",
continue_session=self.session_id
)
Hint 4: Set Hard Limits
Prevent infinite loops:
MAX_ITERATIONS = 5
MIN_PROGRESS = 0.1 # At least 10% improvement per iteration
for i in range(MAX_ITERATIONS):
result = run_tests()
current_coverage = result["coverage"]
if current_coverage >= target:
break
if i > 0:
improvement = current_coverage - last_coverage
if improvement < MIN_PROGRESS:
print("No progress, stopping")
break
last_coverage = current_coverage
fix_failures(result["errors"])
9. Books That Will Help
| Topic | Book | Chapter/Section |
|---|---|---|
| TDD | โTest Driven Developmentโ by Kent Beck | All (foundational) |
| Python testing | โPython Testing with pytestโ by Brian Okken | Ch. 2-5: Writing tests |
| Test design | โxUnit Test Patternsโ by Gerard Meszaros | Ch. 4-6: Test patterns |
| Refactoring | โRefactoringโ by Martin Fowler | Ch. 4: Building tests |
| Clean tests | โClean Codeโ by Robert Martin | Ch. 9: Unit tests |
10. Implementation Guide
10.1 Complete TDD Automator
import subprocess
import json
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional
@dataclass
class TestRunResult:
passed: bool
total_tests: int
passed_tests: int
failed_tests: int
coverage: float
errors: list
output: str
@dataclass
class IterationResult:
iteration: int
tests_fixed: int
new_coverage: float
remaining_failures: list
class TDDAutomator:
def __init__(
self,
source_dir: str,
test_dir: str = "tests",
target_coverage: float = 80.0,
max_iterations: int = 5
):
self.source_dir = Path(source_dir)
self.test_dir = Path(test_dir)
self.target_coverage = target_coverage
self.max_iterations = max_iterations
self.session_id: Optional[str] = None
def run(self) -> dict:
"""Run the full TDD automation loop."""
print("\n" + "=" * 60)
print("AUTOMATED TEST GENERATION")
print("=" * 60)
# Phase 1: Analyze
print("\nPhase 1: Analyzing code...")
analysis = self.analyze_code()
print(f" Files: {analysis['file_count']}")
print(f" Functions: {analysis['function_count']}")
print(f" Current coverage: {analysis['coverage']:.1f}%")
if analysis['coverage'] >= self.target_coverage:
print(f"\nCoverage target already met!")
return {"success": True, "coverage": analysis['coverage']}
# Phase 2: Generate initial tests
print("\nPhase 2: Generating tests...")
self.generate_initial_tests(analysis)
# Phase 3-6: Run and iterate
for iteration in range(self.max_iterations):
print(f"\nIteration {iteration + 1}/{self.max_iterations}")
# Run tests
result = self.run_tests()
print(f" Tests: {result.passed_tests}/{result.total_tests} passed")
print(f" Coverage: {result.coverage:.1f}%")
# Check if done
if result.passed and result.coverage >= self.target_coverage:
print(f"\nTarget coverage achieved!")
return {"success": True, "coverage": result.coverage}
# Check if no failures to fix
if result.passed and result.coverage < self.target_coverage:
print("\nAll tests pass but coverage too low.")
print("Generating additional tests...")
self.generate_additional_tests(result.coverage)
continue
# Fix failures
print(f" Fixing {result.failed_tests} failures...")
self.fix_failures(result.errors)
# Max iterations reached
print(f"\nMax iterations reached.")
final_result = self.run_tests()
return {
"success": False,
"coverage": final_result.coverage,
"remaining_failures": final_result.errors
}
def analyze_code(self) -> dict:
"""Analyze source code for testing opportunities."""
files = list(self.source_dir.glob("**/*.py"))
files = [f for f in files if not f.name.startswith("test_")]
# Get current coverage
coverage = self._get_coverage()
# Count functions (simplified)
function_count = 0
for f in files:
content = f.read_text()
function_count += content.count("def ")
return {
"file_count": len(files),
"function_count": function_count,
"coverage": coverage,
"files": [str(f) for f in files]
}
def generate_initial_tests(self, analysis: dict):
"""Generate initial test suite."""
prompt = f"""Generate a comprehensive pytest test suite for these files:
{chr(10).join(analysis['files'])}
For each file, create tests covering:
1. Normal functionality (happy path)
2. Edge cases (empty input, None, boundaries)
3. Error handling (invalid input, exceptions)
4. Integration between functions
Output as complete, runnable Python test files.
Use pytest conventions.
Include necessary imports and fixtures.
"""
result = self._run_claude(prompt)
self.session_id = result.get("session_id")
# Save generated tests
self._save_tests(result.get("result", ""))
def generate_additional_tests(self, current_coverage: float):
"""Generate more tests to increase coverage."""
# Get coverage report to find gaps
coverage_data = self._get_coverage_details()
prompt = f"""Current coverage is {current_coverage:.1f}%, target is {self.target_coverage}%.
These lines are not covered:
{json.dumps(coverage_data.get('missing_lines', {}), indent=2)}
Generate additional tests to cover these specific lines.
Focus on the uncovered code paths."""
result = self._run_claude(prompt, continue_session=True)
self._save_tests(result.get("result", ""), append=True)
def run_tests(self) -> TestRunResult:
"""Run pytest and capture results."""
result = subprocess.run(
[
"pytest",
str(self.test_dir),
"-v",
"--tb=short",
f"--cov={self.source_dir}",
"--cov-report=json"
],
capture_output=True,
text=True
)
# Parse results
coverage = self._get_coverage()
errors = self._parse_failures(result.stdout + result.stderr)
# Count tests (parse from output)
passed = result.stdout.count(" PASSED")
failed = result.stdout.count(" FAILED")
return TestRunResult(
passed=result.returncode == 0,
total_tests=passed + failed,
passed_tests=passed,
failed_tests=failed,
coverage=coverage,
errors=errors,
output=result.stdout
)
def fix_failures(self, errors: list):
"""Ask Claude to fix failing tests."""
error_text = "\n\n".join(errors[:5]) # Limit to 5 errors
prompt = f"""These tests are failing:
{error_text}
Fix each failing test. The issue might be:
1. Test logic is wrong
2. Missing imports or fixtures
3. Incorrect assertions
4. Need for mocking
Provide the corrected test code."""
result = self._run_claude(prompt, continue_session=True)
self._update_tests(result.get("result", ""))
def _run_claude(self, prompt: str, continue_session: bool = False) -> dict:
"""Run Claude with optional session continuation."""
cmd = ["claude", "-p", prompt, "--output-format", "json"]
if continue_session and self.session_id:
cmd.extend(["--continue", self.session_id])
result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
if result.returncode != 0:
raise Exception(f"Claude error: {result.stderr}")
return json.loads(result.stdout)
def _get_coverage(self) -> float:
"""Get current test coverage percentage."""
try:
with open("coverage.json") as f:
data = json.load(f)
return data.get("totals", {}).get("percent_covered", 0)
except FileNotFoundError:
return 0.0
def _get_coverage_details(self) -> dict:
"""Get detailed coverage information."""
try:
with open("coverage.json") as f:
data = json.load(f)
missing = {}
for file, info in data.get("files", {}).items():
if info.get("missing_lines"):
missing[file] = info["missing_lines"]
return {"missing_lines": missing}
except FileNotFoundError:
return {}
def _parse_failures(self, output: str) -> list:
"""Extract failure messages from pytest output."""
failures = []
lines = output.split("\n")
in_failure = False
current_failure = []
for line in lines:
if "FAILED" in line:
in_failure = True
current_failure = [line]
elif in_failure:
if line.startswith("===") or line.startswith("---"):
if current_failure:
failures.append("\n".join(current_failure))
in_failure = False
current_failure = []
else:
current_failure.append(line)
return failures
def _save_tests(self, content: str, append: bool = False):
"""Save generated test content to files."""
self.test_dir.mkdir(exist_ok=True)
# Extract test code from Claude's response
test_code = self._extract_code(content)
# Write to test file
test_file = self.test_dir / "test_generated.py"
mode = "a" if append else "w"
with open(test_file, mode) as f:
f.write(test_code)
f.write("\n\n")
def _update_tests(self, content: str):
"""Update test file with fixes."""
# In practice, you'd parse which tests to update
# For simplicity, append fixes
self._save_tests(content, append=True)
def _extract_code(self, content: str) -> str:
"""Extract Python code from Claude's response."""
# Look for code blocks
if "```python" in content:
start = content.find("```python") + 9
end = content.find("```", start)
return content[start:end].strip()
elif "```" in content:
start = content.find("```") + 3
end = content.find("```", start)
return content[start:end].strip()
return content
def main():
import argparse
parser = argparse.ArgumentParser(description="Automated TDD with Claude")
parser.add_argument("--source", required=True, help="Source directory")
parser.add_argument("--target-coverage", type=float, default=80.0)
parser.add_argument("--max-iterations", type=int, default=5)
args = parser.parse_args()
automator = TDDAutomator(
source_dir=args.source,
target_coverage=args.target_coverage,
max_iterations=args.max_iterations
)
result = automator.run()
print("\n" + "=" * 60)
print("FINAL RESULT")
print("=" * 60)
print(f"Success: {result['success']}")
print(f"Coverage: {result['coverage']:.1f}%")
if not result['success']:
print(f"Remaining failures: {len(result.get('remaining_failures', []))}")
if __name__ == "__main__":
main()
10.2 GitHub Actions Integration
name: Auto Test Generation
on:
push:
branches: [main]
workflow_dispatch:
inputs:
target_coverage:
description: 'Target coverage percentage'
required: false
default: '80'
jobs:
generate-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install pytest pytest-cov
npm install -g @anthropic-ai/claude-code
- name: Run TDD Automator
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: |
python auto_test.py \
--source ./src \
--target-coverage ${{ github.event.inputs.target_coverage || '80' }}
- name: Upload coverage report
uses: codecov/codecov-action@v3
with:
file: coverage.json
- name: Create PR with generated tests
if: success()
uses: peter-evans/create-pull-request@v5
with:
title: "Auto-generated tests (coverage: ${{ env.COVERAGE }}%)"
commit-message: "Add auto-generated tests"
branch: auto-tests
body: |
## Automated Test Generation
This PR adds auto-generated tests achieving ${{ env.COVERAGE }}% coverage.
Please review the generated tests for:
- Correctness of assertions
- Appropriate edge cases
- Test clarity and naming
10.3 Prompt Engineering for Better Tests
GENERATION_PROMPT_TEMPLATE = """You are an expert test engineer. Generate comprehensive pytest tests for this code:
```python
{source_code}
Requirements
- Test Coverage
- Test every public function/method
- Include at least 3 tests per function
- Cover happy path, edge cases, and error handling
- Test Quality
- Use descriptive test names:
test_<function>_<scenario>_<expected_result> - One assertion per test (prefer)
- Use fixtures for setup/teardown
- Add docstrings explaining what each test verifies
- Use descriptive test names:
- Edge Cases to Consider
- Empty inputs ([], โโ, None, 0)
- Boundary values (min, max, off-by-one)
- Invalid types
- Large inputs
- Concurrent access (if applicable)
- Mocking
- Mock external dependencies (APIs, databases)
- Use
@pytest.fixturefor reusable mocks - Use
unittest.mock.patchfor patching
Output Format
import pytest
from unittest.mock import Mock, patch
from {module} import {functions}
# Fixtures
@pytest.fixture
def sample_data():
return {{...}}
# Tests
def test_function_happy_path():
\"\"\"Verify function works with valid input.\"\"\"
result = function(valid_input)
assert result == expected
def test_function_empty_input():
\"\"\"Verify function handles empty input gracefully.\"\"\"
result = function([])
assert result == {{}}
# ... more tests
โโโ
FIX_FAILURES_PROMPT = โ"โThese tests are failing. Fix them:
{failures}
Analysis
For each failure:
- Identify if the issue is in the test or the code under test
- If test issue: fix the test logic, assertions, or setup
- If code issue: document it but donโt change the source code
Output
Provide the corrected test code. Explain what was wrong and how you fixed it. โโโ
---
## 11. Learning Milestones
| Milestone | Description | Verification |
|-----------|-------------|--------------|
| 1 | Tests generate and save | test_generated.py exists |
| 2 | Tests can be run | pytest runs without crash |
| 3 | Coverage is measured | coverage.json has data |
| 4 | Failures are parsed | Error list extracted |
| 5 | Fixes are applied | Fewer failures next run |
| 6 | Loop terminates correctly | Stops at target or max iterations |
| 7 | CI integration works | GitHub Action runs successfully |
---
## 12. Common Pitfalls
### 12.1 Infinite Loops
```python
# WRONG: No exit condition
while coverage < target:
fix_failures()
coverage = run_tests() # May never improve
# RIGHT: Limit iterations and check progress
for i in range(MAX_ITERATIONS):
if coverage >= target:
break
if coverage == last_coverage:
print("No progress, stopping")
break
last_coverage = coverage
12.2 Context Loss
# WRONG: New session each time (loses context)
claude("-p", "Fix these tests")
claude("-p", "Now fix these other tests")
# RIGHT: Continue session
result = claude("-p", "Generate tests", "--output-format", "json")
session_id = result["session_id"]
claude("-p", "Fix tests", "--continue", session_id)
12.3 Overwhelming Error Context
# WRONG: Send all 50 errors
prompt = f"Fix all these errors:\n{all_errors}" # Too much!
# RIGHT: Batch errors
for batch in chunk_list(errors, size=5):
prompt = f"Fix these errors:\n{batch}"
fix_batch(prompt)
12.4 Ignoring Test Quality
# WRONG: Only care about coverage
if coverage >= 80:
print("Done!")
# RIGHT: Also check test quality
if coverage >= 80:
if all_tests_meaningful(): # Check assertions exist
if no_flaky_tests():
print("Done!")
13. Extension Ideas
- Mutation testing: Verify tests catch injected bugs
- Property-based testing: Generate hypothesis tests
- Visual diff: Show before/after test coverage
- Test prioritization: Run most likely to fail first
- Multi-language: Support JavaScript (Jest), Go, Rust
- Team metrics: Track test generation effectiveness over time
14. Summary
This project teaches you to:
- Automate the TDD cycle with Claude
- Generate, run, and iterate on tests
- Parse test output for feedback
- Track and improve coverage metrics
- Integrate with CI/CD pipelines
The key insight is that test generation is an iterative process. The first pass rarely achieves perfect coverage. By maintaining session context and feeding failures back to Claude, you create a learning loop that progressively improves the test suite.
Key takeaway: Automated test generation is a force multiplier, not a replacement for human judgment. Use it to create the initial scaffolding, then review and refine. The goal is faster time-to-coverage, not perfect tests on first try.