Project 28: Headless Testing Framework - Automated Test Generation

Project 28: Headless Testing Framework - Automated Test Generation

Build an automated test generation system using headless Claude: analyze code to generate tests, run tests and fix failures iteratively, achieve coverage targets, and integrate with CI.

Quick Reference

Attribute Value
Difficulty Expert
Time Estimate 3 weeks
Language Python (Alternatives: TypeScript, Go)
Prerequisites Projects 24-27 completed, testing experience
Key Topics TDD, test generation, coverage analysis, iterative refinement, CI integration
Main Book โ€œTest Driven Developmentโ€ by Kent Beck

1. Learning Objectives

By completing this project, you will:

  1. Automate test generation: Use Claude to create comprehensive test suites
  2. Implement the TDD loop: Generate, run, fix, repeat until targets met
  3. Parse test output: Extract failures and feed back to Claude for fixes
  4. Track coverage metrics: Measure and optimize test coverage
  5. Handle iteration limits: Know when to stop and escalate
  6. Integrate with CI: Run automated test generation in pipelines
  7. Design quality prompts: Craft prompts that produce good tests

2. Real World Outcome

When complete, youโ€™ll have a system that automatically generates and refines tests:

$ python auto-test.py --source ./src/auth --target-coverage 80

Automated Test Generation

Phase 1: Analyze code
----- Files: 5
----- Functions: 23
----- Current coverage: 45%
----- Gap to 80%: 35%

Phase 2: Generate tests
----- Generating tests for login()
----- Generating tests for logout()
----- Generating tests for refresh_token()
----- Generating tests for validate_session()
...

Phase 3: Run tests
----- Tests run: 42
----- Passed: 38
----- Failed: 4
----- Coverage: 72%

Phase 4: Fix failing tests (iteration 1)
----- Fixing test_login_invalid_password
----- Fixing test_token_expiry
----- Fixing test_session_validation_edge_case
----- Fixing test_concurrent_logout

Phase 5: Run tests (iteration 2)
----- Tests run: 42
----- Passed: 42
----- Failed: 0
----- Coverage: 83%

Target coverage achieved!

Generated files:
- tests/test_login.py
- tests/test_logout.py
- tests/test_token.py
- tests/test_session.py

The TDD Automation Loop

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    AUTOMATED TDD LOOP                            โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  START โ”€โ”€โ–บ Analyze โ”€โ”€โ–บ Generate โ”€โ”€โ–บ Run โ”€โ”€โ”ฌโ”€โ”€โ–บ Coverage met?    โ”‚
โ”‚                                           โ”‚         โ”‚            โ”‚
โ”‚              โ–ฒ                            โ”‚         โ–ผ            โ”‚
โ”‚              โ”‚                            โ”‚    YES: Done!        โ”‚
โ”‚              โ”‚                            โ”‚    NO:  โ”€โ”€โ”€โ”         โ”‚
โ”‚              โ”‚                            โ”‚            โ”‚         โ”‚
โ”‚              โ””โ”€โ”€โ”€โ”€ Fix failures โ—„โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

3. The Core Question Youโ€™re Answering

โ€œHow do I use Claude to automatically generate, run, and iterate on tests until coverage targets are met?โ€

TDD requires iterative refinement. This project automates the entire cycle: generate tests, run them, fix failures, and repeat until quality targets are achieved. The result is a testing assistant that handles the mechanical work while you focus on design.

Why This Matters

Manual test writing is:

  • Time-consuming (often 50%+ of development time)
  • Tedious (especially for edge cases)
  • Inconsistent (quality varies by developer)
  • Often skipped (when deadlines loom)

Automated test generation provides:

  • Consistent baseline coverage
  • Edge cases you might miss
  • Faster initial test suite creation
  • Learning opportunity (study generated tests)

4. Concepts You Must Understand First

Stop and research these before coding:

4.1 Test Generation Fundamentals

What makes a good test?

# GOOD TEST: Clear, focused, independent
def test_login_with_valid_credentials():
    """User can log in with correct username and password."""
    user = create_test_user(password="secret123")
    result = login(user.email, "secret123")
    assert result.success is True
    assert result.token is not None

# BAD TEST: Unclear, testing too much
def test_login():
    # Multiple assertions, unclear what's being tested
    login("a@b.com", "123")
    login("c@d.com", "456")
    # What exactly is this testing?

Key qualities:

  • Focused: One test = one behavior
  • Named clearly: Describes whatโ€™s tested
  • Independent: No test-order dependencies
  • Deterministic: Same result every run

Reference: โ€œTest Driven Developmentโ€ by Kent Beck, Chapter 1-5

4.2 Coverage Analysis

# Run pytest with coverage
pytest --cov=src --cov-report=json tests/

# coverage.json structure
{
  "totals": {
    "covered_lines": 450,
    "num_statements": 600,
    "percent_covered": 75.0
  },
  "files": {
    "src/auth/login.py": {
      "covered_lines": [1,2,5,6,7],
      "missing_lines": [10,11,15],
      "percent_covered": 70.0
    }
  }
}

Coverage types:

  • Line coverage: Which lines executed
  • Branch coverage: Which if/else branches taken
  • Function coverage: Which functions called
  • Path coverage: Which execution paths taken

Reference: pytest-cov documentation

4.3 Iterative Refinement

Using --continue maintains context across iterations:

# Initial generation
claude -p "Generate tests for login.py" --output-format json
# Returns session_id: "abc123"

# Fix failures with context
claude -p "Fix this failing test: $ERROR" --continue abc123
# Claude remembers the code and previous tests

Key Questions:

  • How many iterations before giving up?
  • When to start fresh vs continue?
  • How to provide useful error context?

5. Questions to Guide Your Design

5.1 What Tests to Generate?

Test Type Complexity Claudeโ€™s Strength
Unit tests Low Excellent
Integration tests Medium Good
Edge case tests Medium Excellent
Error handling tests Medium Good
Performance tests High Limited
Security tests High Good

5.2 How to Handle Failures?

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    FAILURE HANDLING MATRIX                       โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Failure Type              โ”‚ Action                             โ”‚
โ”‚  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€   โ”‚
โ”‚  Syntax error              โ”‚ Fix immediately (obvious issue)    โ”‚
โ”‚  Import error              โ”‚ Add missing imports                โ”‚
โ”‚  Assertion failure         โ”‚ Analyze, may be test or code bug   โ”‚
โ”‚  Runtime error             โ”‚ Check mocking, dependencies        โ”‚
โ”‚  Timeout                   โ”‚ Simplify or skip test              โ”‚
โ”‚  Flaky (intermittent)      โ”‚ Add retries or mark as flaky       โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

5.3 When to Stop?

Stopping conditions:

  • Coverage target reached
  • All tests pass
  • Maximum iterations exceeded
  • No progress (same failures repeating)
  • Human intervention required

6. Thinking Exercise

Design the TDD Automation Loop

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                    TDD AUTOMATION LOOP                           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                                  โ”‚
โ”‚  Start: Source code + Coverage target (e.g., 80%)               โ”‚
โ”‚       โ”‚                                                          โ”‚
โ”‚       โ–ผ                                                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 1. ANALYZE CODE โ”‚ Find untested functions                    โ”‚
โ”‚  โ”‚    - Parse AST  โ”‚ List public functions/methods              โ”‚
โ”‚  โ”‚    - Get coverageโ”‚ Identify gaps                              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 2. GENERATE     โ”‚ Claude: create tests for each gap          โ”‚
โ”‚  โ”‚    TESTS        โ”‚ Include edge cases, error handling         โ”‚
โ”‚  โ”‚                 โ”‚ Use session for context                    โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 3. WRITE FILES  โ”‚ Save generated tests to test files         โ”‚
โ”‚  โ”‚                 โ”‚ tests/test_<module>.py                     โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 4. RUN TESTS    โ”‚ pytest --cov --tb=short                    โ”‚
โ”‚  โ”‚                 โ”‚ Capture output                              โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ”œโ”€โ”€โ”€โ”€ All pass & coverage >= target โ”€โ”€โ–ถ DONE!         โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ (failures or low coverage)                          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 5. PARSE OUTPUT โ”‚ Extract failure messages                   โ”‚
โ”‚  โ”‚                 โ”‚ Format for Claude                          โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 6. FIX FAILURES โ”‚ Claude: fix based on errors                โ”‚
โ”‚  โ”‚    --continue   โ”‚ Use session context                        โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ”‚  iteration++                                         โ”‚
โ”‚           โ”‚                                                      โ”‚
โ”‚           โ”œโ”€โ”€โ”€โ”€ iteration < max โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                     โ”‚
โ”‚           โ”‚                               โ”‚                     โ”‚
โ”‚           โ–ผ                               โ”‚                     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                      โ”‚                     โ”‚
โ”‚  โ”‚ 7. CHECK LIMITS โ”‚                      โ”‚                     โ”‚
โ”‚  โ”‚ Max iterations? โ”‚โ”€โ”€โ”€ NO โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                     โ”‚
โ”‚  โ”‚                 โ”‚                      (loop to step 4)      โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚           โ”‚ YES                                                  โ”‚
โ”‚           โ–ผ                                                      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                            โ”‚
โ”‚  โ”‚ 8. REPORT       โ”‚ Partial results + remaining failures      โ”‚
โ”‚  โ”‚                 โ”‚ Human intervention needed                  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                            โ”‚
โ”‚                                                                  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Design Questions

  1. How many iterations before giving up?
    • Typically 3-5 for most failures
    • More iterations = diminishing returns
    • Track if same error repeats
  2. Should you fix one test at a time or all at once?
    • One at a time: Clearer feedback
    • All at once: Faster, but harder to debug
    • Hybrid: Batch similar errors
  3. How do you handle flaky tests?
    • Retry 2-3 times before marking as flaky
    • Add @pytest.mark.flaky if persistent
    • Consider test isolation issues

7. The Interview Questions Theyโ€™ll Ask

7.1 โ€œHow would you automate test generation for a codebase?โ€

Good answer structure:

  1. Analyze code to identify functions/methods
  2. Generate tests with Claude using structured prompts
  3. Run tests and capture output
  4. Parse failures and iterate
  5. Stop when coverage target met or max iterations reached

7.2 โ€œWhatโ€™s the TDD cycle and how would you automate it?โ€

TDD cycle:

  1. Red: Write failing test
  2. Green: Make it pass
  3. Refactor: Clean up

Automation approach:

  • Generate tests (Red is implicit - tests define expected behavior)
  • Fix failures (Green)
  • Improve tests in next iteration (Refactor)

7.3 โ€œHow do you measure test quality beyond coverage?โ€

Quality metrics:

  • Mutation testing: Do tests catch injected bugs?
  • Assertion density: Tests with more assertions find more bugs
  • Test isolation: Independent tests are more reliable
  • Flakiness rate: How often do tests fail randomly?
  • Execution time: Fast tests run more often

7.4 โ€œHow do you handle test failures in an automated pipeline?โ€

Strategies:

  • Parse error messages for context
  • Feed errors back to Claude with session context
  • Retry with exponential backoff
  • After max retries, mark as needing human review
  • Generate report of remaining failures

7.5 โ€œWhat are the limits of AI-generated tests?โ€

Limitations:

  • May miss domain-specific edge cases
  • Can generate superficial tests (just calling functions)
  • May not understand business logic
  • Can create brittle tests tied to implementation
  • Requires human review for quality

Best used for:

  • Initial test scaffolding
  • Edge case suggestions
  • Boilerplate test code
  • Coverage gap identification

8. Hints in Layers

Hint 1: Start with One File

Generate tests for a single file first:

def generate_tests_for_file(filepath: str) -> str:
    prompt = f"""Generate pytest tests for this file:

{read_file(filepath)}

Include:
- Tests for each public function
- Edge cases (empty input, None, boundaries)
- Error handling tests

Output as valid Python code."""

    return run_claude(prompt)

Hint 2: Parse pytest Output

Use --tb=short for concise errors:

def run_tests():
    result = subprocess.run(
        ["pytest", "--tb=short", "-v", "--cov=src", "--cov-report=json"],
        capture_output=True,
        text=True
    )

    return {
        "passed": result.returncode == 0,
        "output": result.stdout,
        "errors": result.stderr
    }

Hint 3: Use โ€“continue for Context

Maintain session across iterations:

class TDDAutomator:
    def __init__(self):
        self.session_id = None

    def generate_initial(self, prompt):
        result = run_claude(prompt, output_format="json")
        self.session_id = result["session_id"]
        return result

    def fix_failures(self, errors):
        return run_claude(
            f"Fix these test failures:\n{errors}",
            continue_session=self.session_id
        )

Hint 4: Set Hard Limits

Prevent infinite loops:

MAX_ITERATIONS = 5
MIN_PROGRESS = 0.1  # At least 10% improvement per iteration

for i in range(MAX_ITERATIONS):
    result = run_tests()
    current_coverage = result["coverage"]

    if current_coverage >= target:
        break

    if i > 0:
        improvement = current_coverage - last_coverage
        if improvement < MIN_PROGRESS:
            print("No progress, stopping")
            break

    last_coverage = current_coverage
    fix_failures(result["errors"])

9. Books That Will Help

Topic Book Chapter/Section
TDD โ€œTest Driven Developmentโ€ by Kent Beck All (foundational)
Python testing โ€œPython Testing with pytestโ€ by Brian Okken Ch. 2-5: Writing tests
Test design โ€œxUnit Test Patternsโ€ by Gerard Meszaros Ch. 4-6: Test patterns
Refactoring โ€œRefactoringโ€ by Martin Fowler Ch. 4: Building tests
Clean tests โ€œClean Codeโ€ by Robert Martin Ch. 9: Unit tests

10. Implementation Guide

10.1 Complete TDD Automator

import subprocess
import json
from pathlib import Path
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class TestRunResult:
    passed: bool
    total_tests: int
    passed_tests: int
    failed_tests: int
    coverage: float
    errors: list
    output: str

@dataclass
class IterationResult:
    iteration: int
    tests_fixed: int
    new_coverage: float
    remaining_failures: list

class TDDAutomator:
    def __init__(
        self,
        source_dir: str,
        test_dir: str = "tests",
        target_coverage: float = 80.0,
        max_iterations: int = 5
    ):
        self.source_dir = Path(source_dir)
        self.test_dir = Path(test_dir)
        self.target_coverage = target_coverage
        self.max_iterations = max_iterations
        self.session_id: Optional[str] = None

    def run(self) -> dict:
        """Run the full TDD automation loop."""
        print("\n" + "=" * 60)
        print("AUTOMATED TEST GENERATION")
        print("=" * 60)

        # Phase 1: Analyze
        print("\nPhase 1: Analyzing code...")
        analysis = self.analyze_code()
        print(f"  Files: {analysis['file_count']}")
        print(f"  Functions: {analysis['function_count']}")
        print(f"  Current coverage: {analysis['coverage']:.1f}%")

        if analysis['coverage'] >= self.target_coverage:
            print(f"\nCoverage target already met!")
            return {"success": True, "coverage": analysis['coverage']}

        # Phase 2: Generate initial tests
        print("\nPhase 2: Generating tests...")
        self.generate_initial_tests(analysis)

        # Phase 3-6: Run and iterate
        for iteration in range(self.max_iterations):
            print(f"\nIteration {iteration + 1}/{self.max_iterations}")

            # Run tests
            result = self.run_tests()
            print(f"  Tests: {result.passed_tests}/{result.total_tests} passed")
            print(f"  Coverage: {result.coverage:.1f}%")

            # Check if done
            if result.passed and result.coverage >= self.target_coverage:
                print(f"\nTarget coverage achieved!")
                return {"success": True, "coverage": result.coverage}

            # Check if no failures to fix
            if result.passed and result.coverage < self.target_coverage:
                print("\nAll tests pass but coverage too low.")
                print("Generating additional tests...")
                self.generate_additional_tests(result.coverage)
                continue

            # Fix failures
            print(f"  Fixing {result.failed_tests} failures...")
            self.fix_failures(result.errors)

        # Max iterations reached
        print(f"\nMax iterations reached.")
        final_result = self.run_tests()
        return {
            "success": False,
            "coverage": final_result.coverage,
            "remaining_failures": final_result.errors
        }

    def analyze_code(self) -> dict:
        """Analyze source code for testing opportunities."""
        files = list(self.source_dir.glob("**/*.py"))
        files = [f for f in files if not f.name.startswith("test_")]

        # Get current coverage
        coverage = self._get_coverage()

        # Count functions (simplified)
        function_count = 0
        for f in files:
            content = f.read_text()
            function_count += content.count("def ")

        return {
            "file_count": len(files),
            "function_count": function_count,
            "coverage": coverage,
            "files": [str(f) for f in files]
        }

    def generate_initial_tests(self, analysis: dict):
        """Generate initial test suite."""
        prompt = f"""Generate a comprehensive pytest test suite for these files:

{chr(10).join(analysis['files'])}

For each file, create tests covering:
1. Normal functionality (happy path)
2. Edge cases (empty input, None, boundaries)
3. Error handling (invalid input, exceptions)
4. Integration between functions

Output as complete, runnable Python test files.
Use pytest conventions.
Include necessary imports and fixtures.
"""

        result = self._run_claude(prompt)
        self.session_id = result.get("session_id")

        # Save generated tests
        self._save_tests(result.get("result", ""))

    def generate_additional_tests(self, current_coverage: float):
        """Generate more tests to increase coverage."""
        # Get coverage report to find gaps
        coverage_data = self._get_coverage_details()

        prompt = f"""Current coverage is {current_coverage:.1f}%, target is {self.target_coverage}%.

These lines are not covered:
{json.dumps(coverage_data.get('missing_lines', {}), indent=2)}

Generate additional tests to cover these specific lines.
Focus on the uncovered code paths."""

        result = self._run_claude(prompt, continue_session=True)
        self._save_tests(result.get("result", ""), append=True)

    def run_tests(self) -> TestRunResult:
        """Run pytest and capture results."""
        result = subprocess.run(
            [
                "pytest",
                str(self.test_dir),
                "-v",
                "--tb=short",
                f"--cov={self.source_dir}",
                "--cov-report=json"
            ],
            capture_output=True,
            text=True
        )

        # Parse results
        coverage = self._get_coverage()
        errors = self._parse_failures(result.stdout + result.stderr)

        # Count tests (parse from output)
        passed = result.stdout.count(" PASSED")
        failed = result.stdout.count(" FAILED")

        return TestRunResult(
            passed=result.returncode == 0,
            total_tests=passed + failed,
            passed_tests=passed,
            failed_tests=failed,
            coverage=coverage,
            errors=errors,
            output=result.stdout
        )

    def fix_failures(self, errors: list):
        """Ask Claude to fix failing tests."""
        error_text = "\n\n".join(errors[:5])  # Limit to 5 errors

        prompt = f"""These tests are failing:

{error_text}

Fix each failing test. The issue might be:
1. Test logic is wrong
2. Missing imports or fixtures
3. Incorrect assertions
4. Need for mocking

Provide the corrected test code."""

        result = self._run_claude(prompt, continue_session=True)
        self._update_tests(result.get("result", ""))

    def _run_claude(self, prompt: str, continue_session: bool = False) -> dict:
        """Run Claude with optional session continuation."""
        cmd = ["claude", "-p", prompt, "--output-format", "json"]

        if continue_session and self.session_id:
            cmd.extend(["--continue", self.session_id])

        result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)

        if result.returncode != 0:
            raise Exception(f"Claude error: {result.stderr}")

        return json.loads(result.stdout)

    def _get_coverage(self) -> float:
        """Get current test coverage percentage."""
        try:
            with open("coverage.json") as f:
                data = json.load(f)
                return data.get("totals", {}).get("percent_covered", 0)
        except FileNotFoundError:
            return 0.0

    def _get_coverage_details(self) -> dict:
        """Get detailed coverage information."""
        try:
            with open("coverage.json") as f:
                data = json.load(f)
                missing = {}
                for file, info in data.get("files", {}).items():
                    if info.get("missing_lines"):
                        missing[file] = info["missing_lines"]
                return {"missing_lines": missing}
        except FileNotFoundError:
            return {}

    def _parse_failures(self, output: str) -> list:
        """Extract failure messages from pytest output."""
        failures = []
        lines = output.split("\n")

        in_failure = False
        current_failure = []

        for line in lines:
            if "FAILED" in line:
                in_failure = True
                current_failure = [line]
            elif in_failure:
                if line.startswith("===") or line.startswith("---"):
                    if current_failure:
                        failures.append("\n".join(current_failure))
                    in_failure = False
                    current_failure = []
                else:
                    current_failure.append(line)

        return failures

    def _save_tests(self, content: str, append: bool = False):
        """Save generated test content to files."""
        self.test_dir.mkdir(exist_ok=True)

        # Extract test code from Claude's response
        test_code = self._extract_code(content)

        # Write to test file
        test_file = self.test_dir / "test_generated.py"
        mode = "a" if append else "w"
        with open(test_file, mode) as f:
            f.write(test_code)
            f.write("\n\n")

    def _update_tests(self, content: str):
        """Update test file with fixes."""
        # In practice, you'd parse which tests to update
        # For simplicity, append fixes
        self._save_tests(content, append=True)

    def _extract_code(self, content: str) -> str:
        """Extract Python code from Claude's response."""
        # Look for code blocks
        if "```python" in content:
            start = content.find("```python") + 9
            end = content.find("```", start)
            return content[start:end].strip()
        elif "```" in content:
            start = content.find("```") + 3
            end = content.find("```", start)
            return content[start:end].strip()
        return content


def main():
    import argparse

    parser = argparse.ArgumentParser(description="Automated TDD with Claude")
    parser.add_argument("--source", required=True, help="Source directory")
    parser.add_argument("--target-coverage", type=float, default=80.0)
    parser.add_argument("--max-iterations", type=int, default=5)
    args = parser.parse_args()

    automator = TDDAutomator(
        source_dir=args.source,
        target_coverage=args.target_coverage,
        max_iterations=args.max_iterations
    )

    result = automator.run()

    print("\n" + "=" * 60)
    print("FINAL RESULT")
    print("=" * 60)
    print(f"Success: {result['success']}")
    print(f"Coverage: {result['coverage']:.1f}%")

    if not result['success']:
        print(f"Remaining failures: {len(result.get('remaining_failures', []))}")


if __name__ == "__main__":
    main()

10.2 GitHub Actions Integration

name: Auto Test Generation

on:
  push:
    branches: [main]
  workflow_dispatch:
    inputs:
      target_coverage:
        description: 'Target coverage percentage'
        required: false
        default: '80'

jobs:
  generate-tests:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install pytest pytest-cov
          npm install -g @anthropic-ai/claude-code

      - name: Run TDD Automator
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          python auto_test.py \
            --source ./src \
            --target-coverage ${{ github.event.inputs.target_coverage || '80' }}

      - name: Upload coverage report
        uses: codecov/codecov-action@v3
        with:
          file: coverage.json

      - name: Create PR with generated tests
        if: success()
        uses: peter-evans/create-pull-request@v5
        with:
          title: "Auto-generated tests (coverage: ${{ env.COVERAGE }}%)"
          commit-message: "Add auto-generated tests"
          branch: auto-tests
          body: |
            ## Automated Test Generation

            This PR adds auto-generated tests achieving ${{ env.COVERAGE }}% coverage.

            Please review the generated tests for:
            - Correctness of assertions
            - Appropriate edge cases
            - Test clarity and naming

10.3 Prompt Engineering for Better Tests

GENERATION_PROMPT_TEMPLATE = """You are an expert test engineer. Generate comprehensive pytest tests for this code:

```python
{source_code}

Requirements

  1. Test Coverage
    • Test every public function/method
    • Include at least 3 tests per function
    • Cover happy path, edge cases, and error handling
  2. Test Quality
    • Use descriptive test names: test_<function>_<scenario>_<expected_result>
    • One assertion per test (prefer)
    • Use fixtures for setup/teardown
    • Add docstrings explaining what each test verifies
  3. Edge Cases to Consider
    • Empty inputs ([], โ€œโ€, None, 0)
    • Boundary values (min, max, off-by-one)
    • Invalid types
    • Large inputs
    • Concurrent access (if applicable)
  4. Mocking
    • Mock external dependencies (APIs, databases)
    • Use @pytest.fixture for reusable mocks
    • Use unittest.mock.patch for patching

Output Format

import pytest
from unittest.mock import Mock, patch
from {module} import {functions}

# Fixtures
@pytest.fixture
def sample_data():
    return {{...}}

# Tests
def test_function_happy_path():
    \"\"\"Verify function works with valid input.\"\"\"
    result = function(valid_input)
    assert result == expected

def test_function_empty_input():
    \"\"\"Verify function handles empty input gracefully.\"\"\"
    result = function([])
    assert result == {{}}

# ... more tests

โ€โ€โ€

FIX_FAILURES_PROMPT = โ€œ"โ€These tests are failing. Fix them:

{failures}

Analysis

For each failure:

  1. Identify if the issue is in the test or the code under test
  2. If test issue: fix the test logic, assertions, or setup
  3. If code issue: document it but donโ€™t change the source code

Output

Provide the corrected test code. Explain what was wrong and how you fixed it. โ€œโ€โ€


---

## 11. Learning Milestones

| Milestone | Description | Verification |
|-----------|-------------|--------------|
| 1 | Tests generate and save | test_generated.py exists |
| 2 | Tests can be run | pytest runs without crash |
| 3 | Coverage is measured | coverage.json has data |
| 4 | Failures are parsed | Error list extracted |
| 5 | Fixes are applied | Fewer failures next run |
| 6 | Loop terminates correctly | Stops at target or max iterations |
| 7 | CI integration works | GitHub Action runs successfully |

---

## 12. Common Pitfalls

### 12.1 Infinite Loops

```python
# WRONG: No exit condition
while coverage < target:
    fix_failures()
    coverage = run_tests()  # May never improve

# RIGHT: Limit iterations and check progress
for i in range(MAX_ITERATIONS):
    if coverage >= target:
        break
    if coverage == last_coverage:
        print("No progress, stopping")
        break
    last_coverage = coverage

12.2 Context Loss

# WRONG: New session each time (loses context)
claude("-p", "Fix these tests")
claude("-p", "Now fix these other tests")

# RIGHT: Continue session
result = claude("-p", "Generate tests", "--output-format", "json")
session_id = result["session_id"]
claude("-p", "Fix tests", "--continue", session_id)

12.3 Overwhelming Error Context

# WRONG: Send all 50 errors
prompt = f"Fix all these errors:\n{all_errors}"  # Too much!

# RIGHT: Batch errors
for batch in chunk_list(errors, size=5):
    prompt = f"Fix these errors:\n{batch}"
    fix_batch(prompt)

12.4 Ignoring Test Quality

# WRONG: Only care about coverage
if coverage >= 80:
    print("Done!")

# RIGHT: Also check test quality
if coverage >= 80:
    if all_tests_meaningful():  # Check assertions exist
        if no_flaky_tests():
            print("Done!")

13. Extension Ideas

  1. Mutation testing: Verify tests catch injected bugs
  2. Property-based testing: Generate hypothesis tests
  3. Visual diff: Show before/after test coverage
  4. Test prioritization: Run most likely to fail first
  5. Multi-language: Support JavaScript (Jest), Go, Rust
  6. Team metrics: Track test generation effectiveness over time

14. Summary

This project teaches you to:

  • Automate the TDD cycle with Claude
  • Generate, run, and iterate on tests
  • Parse test output for feedback
  • Track and improve coverage metrics
  • Integrate with CI/CD pipelines

The key insight is that test generation is an iterative process. The first pass rarely achieves perfect coverage. By maintaining session context and feeding failures back to Claude, you create a learning loop that progressively improves the test suite.

Key takeaway: Automated test generation is a force multiplier, not a replacement for human judgment. Use it to create the initial scaffolding, then review and refine. The goal is faster time-to-coverage, not perfect tests on first try.