Project 12: The Self-Improving Assistant (Agentic Tool-Maker)

Build an assistant that can create new tools for itself: write code, run it in a sandbox, validate output, and register the tool for future use—safely.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 40–60 hours
Language Python
Prerequisites Strong tool/agent fundamentals, sandboxing/security mindset, debugging experience
Key Topics sandboxed code execution, capability gating, self-correction, persistence, security constraints

1. Learning Objectives

By completing this project, you will:

  1. Build a “tool creation loop”: decide need → write code → test → use.
  2. Execute untrusted code in a sandbox with strict resource limits.
  3. Design validation and scoring so the agent doesn’t register broken tools.
  4. Implement persistence for discovered tools with safe metadata.
  5. Prevent common security failures: exfiltration, fork bombs, filesystem abuse.

2. Theoretical Foundation

2.1 Core Concepts

  • Tool-making vs tool-using: Tool-using chooses among known actions; tool-making expands the action space.
  • Sandboxing: Untrusted code must run in isolation (containers/VMs) with limits on CPU, memory, disk, and network.
  • Capability gating: Not all tools should be creatable; define allowed domains (data processing, parsing) and forbidden ones (network scanners).
  • Validation: The agent must test tools against cases and prove outputs meet constraints before registration.
  • Security as product: Recursive agency without guardrails becomes an attack surface.

2.2 Why This Matters

This is the frontier of “autonomy”: systems that can extend themselves. Even if you never ship self-writing tools, the sandboxing and safety engineering skills are directly relevant to any tool-using agent.

2.3 Common Misconceptions

  • “Just run code with exec().” That is not a sandbox.
  • “We can trust the model.” Treat model outputs as untrusted input.
  • “If tests pass once, it’s safe.” You need ongoing constraints and monitoring.

3. Project Specification

3.1 What You Will Build

An assistant that:

  • Receives a user task (e.g., “Summarize sentiment in these logs”)
  • Detects missing capability
  • Proposes and generates a new tool (Python module/function)
  • Runs it in a sandbox with tests
  • Registers the tool and uses it to complete the task

3.2 Functional Requirements

  1. Tool registry: existing tools + newly created ones, with metadata.
  2. Tool authoring: generate code with a strict template (inputs/outputs).
  3. Sandbox execution: run tool code with resource limits and captured stdout/stderr.
  4. Validation: auto-generate test cases and run them before registration.
  5. Self-correction: on failure, feed back errors and attempt a bounded fix.
  6. Persistence: store tools on disk with versions and audit trail.

3.3 Non-Functional Requirements

  • Security: no arbitrary network access; filesystem write scope limited.
  • Reliability: avoid infinite retries; timeouts for tool creation and runs.
  • Auditability: store “why tool was created” and “which tests passed”.
  • Governance: a human approval step before saving a new tool (recommended).

3.4 Example Usage / Output

User: Analyze sentiment of these 500 JSON logs.

Assistant: I don’t have a sentiment tool. I will create one.
Plan:
  1) Write parse_jsonl() + compute_sentiment()
  2) Run tests on sample logs
  3) Register tool "sentiment_analyzer_v1"
  4) Run tool on your dataset

4. Solution Architecture

4.1 High-Level Design

┌──────────────┐   task   ┌──────────────────┐
│ User/CLI      │────────▶│ Tool-Maker Agent  │
└──────────────┘         │ (plan + generate) │
                           └───────┬─────────┘
                                   │ writes candidate tool
                                   ▼
                            ┌───────────────┐
                            │ Tool Workspace │
                            │ (staging)      │
                            └───────┬───────┘
                                   │ run
                                   ▼
                            ┌───────────────┐
                            │ Sandbox Runner │
                            │ (limits)       │
                            └───────┬───────┘
                                   │ results
                                   ▼
                            ┌───────────────┐
                            │ Validator      │
                            │ (tests+rules)  │
                            └───────┬───────┘
                                   │ approve/register
                                   ▼
                            ┌───────────────┐
                            │ Tool Registry  │
                            └───────────────┘

4.2 Key Components

Component Responsibility Key Decisions
Tool-maker agent produce candidate code strict templates + allowed APIs
Sandbox runner execute untrusted code Docker/E2B; resource + network limits
Validator decide if tool is acceptable test suite + static checks
Registry persist tools safely versioning, metadata, approval flow

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class ToolSpec:
    name: str
    description: str
    input_schema: dict
    output_schema: dict
    version: str

@dataclass(frozen=True)
class SandboxResult:
    exit_code: int
    stdout: str
    stderr: str
    runtime_ms: int

4.4 Algorithm Overview

Key Algorithm: create-and-use loop

  1. Detect that current tools can’t solve the task.
  2. Generate a tool spec and implementation skeleton.
  3. Generate tests (unit tests + property-ish checks where possible).
  4. Execute tests in sandbox with strict limits.
  5. If tests pass and policy allows, register tool; otherwise iterate (bounded).
  6. Use the new tool to solve the user task; record trace.

Complexity Analysis:

  • Time: O(tool_gen_attempts × sandbox_runs)
  • Space: O(staged_code + logs + tool registry)

5. Implementation Guide

5.1 Development Environment Setup

pip install pydantic rich

5.2 Project Structure

self-improving-agent/
├── src/
│   ├── cli.py
│   ├── agent.py
│   ├── registry.py
│   ├── templates/
│   ├── sandbox.py
│   ├── validate.py
│   └── policy.py
└── tools/
    └── generated/

5.3 Implementation Phases

Phase 1: Strict tool templates + registry (8–12h)

Goals:

  • Create and load tools from disk (without execution yet).

Tasks:

  1. Define ToolSpec and a file layout for tools (code + spec + tests).
  2. Implement registry load/list and versioning.

Checkpoint: The assistant can list tools and their schemas.

Phase 2: Sandbox runner + validation (12–18h)

Goals:

  • Execute tools in a sandbox and validate outputs.

Tasks:

  1. Implement sandbox execution with timeouts and no network.
  2. Run tool tests in sandbox and parse results.
  3. Implement policy checks (disallow imports, file writes, network).

Checkpoint: A manually written tool can be tested and run safely.

Phase 3: Tool-making loop + bounded self-correction (20–30h)

Goals:

  • Have the agent generate new tools and iterate on failures.

Tasks:

  1. Create prompts that output code + tests in strict templates.
  2. Feed sandbox stderr back into the agent for one or two repair attempts.
  3. Add human approval before registration (recommended default).

Checkpoint: The system can generate a small tool (e.g., CSV stats) and use it.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Sandbox Docker vs E2B whichever you can run reliably isolation is mandatory
Approval auto-register vs manual approve manual approve safety and governance
Validation tests only vs tests + static rules both tests miss malicious behavior

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit registry/policy tool spec parsing, forbidden imports
Integration sandbox no-network enforcement, timeout behavior
Scenario tool-making generate tool, tests pass, registry update

6.2 Critical Test Cases

  1. Resource limits: infinite loop tool gets killed by timeout.
  2. Network block: tool tries to fetch a URL and fails.
  3. Filesystem scope: tool can’t write outside staging directory.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
“Sandbox” isn’t isolated tool accesses host files use real container/VM isolation
Unbounded retries runaway cost/time max attempts + clear stop criteria
Weak validation tool passes but wrong add golden tests and schema validation
Tool sprawl too many similar tools versioning + consolidation + deprecation

Debugging strategies:

  • Keep every attempt artifact (code, tests, stderr) and diff attempts.
  • Start with a narrow allowlist of tool types (parsers, formatters).

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a “tool gallery” UI that previews specs and test status.
  • Add automatic documentation for new tools.

8.2 Intermediate Extensions

  • Add tool usage analytics (which tools are most helpful?).
  • Add “tool refactoring”: merge duplicates under one interface.

8.3 Advanced Extensions

  • Add formal verification-ish checks (static analysis, import restrictions).
  • Add multi-agent tool-making (separate generator, tester, security reviewer).

9. Real-World Connections

9.1 Industry Applications

  • Secure code execution for agent workflows (data transformation, automation).
  • Internal copilots that generate scripts and validate them before use.

9.3 Interview Relevance

  • Sandboxing, capability gating, safe tool execution, and governance.

10. Resources

10.1 Essential Reading

  • AI Engineering (Chip Huyen) — agentic workflows and safety (Ch. 6)
  • Sandbox platform docs (Docker/E2B) and secure execution patterns

10.3 Tools & Documentation

  • Docker security best practices (no privileged containers, seccomp)
  • Python subprocess and resource limits patterns
  • Previous: Project 8 (multi-agent) — separate roles for generation/testing/security
  • Previous: Project 10 (monitoring) — observe and govern a self-extending system

11. Self-Assessment Checklist

  • I can explain why sandboxing is mandatory and what threat model I used.
  • I can show that tools can’t access network or host filesystem.
  • I can validate tool outputs and reject broken/malicious tools.
  • I can govern tool registration with approvals and versioning.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Tool registry + sandbox runner with strict limits
  • Agent can generate a small tool and run tests in sandbox
  • Manual approval gate before tool registration

Full Completion:

  • Bounded self-correction loop using sandbox stderr
  • Persistent tool versioning and audit trail

Excellence (Going Above & Beyond):

  • Multi-agent generation/testing/security review and automated eval suite

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.