Project 12: The Self-Improving Assistant (Agentic Tool-Maker)

Project 12: The Self-Improving Assistant (Agentic Tool-Maker)

Build an assistant that can create new tools for itself: write code, run it in a sandbox, validate output, and register the tool for future useโ€”safely.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 40โ€“60 hours
Language Python
Prerequisites Strong tool/agent fundamentals, sandboxing/security mindset, debugging experience
Key Topics sandboxed code execution, capability gating, self-correction, persistence, security constraints

1. Learning Objectives

By completing this project, you will:

  1. Build a โ€œtool creation loopโ€: decide need โ†’ write code โ†’ test โ†’ use.
  2. Execute untrusted code in a sandbox with strict resource limits.
  3. Design validation and scoring so the agent doesnโ€™t register broken tools.
  4. Implement persistence for discovered tools with safe metadata.
  5. Prevent common security failures: exfiltration, fork bombs, filesystem abuse.

2. Theoretical Foundation

2.1 Core Concepts

  • Tool-making vs tool-using: Tool-using chooses among known actions; tool-making expands the action space.
  • Sandboxing: Untrusted code must run in isolation (containers/VMs) with limits on CPU, memory, disk, and network.
  • Capability gating: Not all tools should be creatable; define allowed domains (data processing, parsing) and forbidden ones (network scanners).
  • Validation: The agent must test tools against cases and prove outputs meet constraints before registration.
  • Security as product: Recursive agency without guardrails becomes an attack surface.

2.2 Why This Matters

This is the frontier of โ€œautonomyโ€: systems that can extend themselves. Even if you never ship self-writing tools, the sandboxing and safety engineering skills are directly relevant to any tool-using agent.

2.3 Common Misconceptions

  • โ€œJust run code with exec().โ€ That is not a sandbox.
  • โ€œWe can trust the model.โ€ Treat model outputs as untrusted input.
  • โ€œIf tests pass once, itโ€™s safe.โ€ You need ongoing constraints and monitoring.

3. Project Specification

3.1 What You Will Build

An assistant that:

  • Receives a user task (e.g., โ€œSummarize sentiment in these logsโ€)
  • Detects missing capability
  • Proposes and generates a new tool (Python module/function)
  • Runs it in a sandbox with tests
  • Registers the tool and uses it to complete the task

3.2 Functional Requirements

  1. Tool registry: existing tools + newly created ones, with metadata.
  2. Tool authoring: generate code with a strict template (inputs/outputs).
  3. Sandbox execution: run tool code with resource limits and captured stdout/stderr.
  4. Validation: auto-generate test cases and run them before registration.
  5. Self-correction: on failure, feed back errors and attempt a bounded fix.
  6. Persistence: store tools on disk with versions and audit trail.

3.3 Non-Functional Requirements

  • Security: no arbitrary network access; filesystem write scope limited.
  • Reliability: avoid infinite retries; timeouts for tool creation and runs.
  • Auditability: store โ€œwhy tool was createdโ€ and โ€œwhich tests passedโ€.
  • Governance: a human approval step before saving a new tool (recommended).

3.4 Example Usage / Output

User: Analyze sentiment of these 500 JSON logs.

Assistant: I donโ€™t have a sentiment tool. I will create one.
Plan:
  1) Write parse_jsonl() + compute_sentiment()
  2) Run tests on sample logs
  3) Register tool "sentiment_analyzer_v1"
  4) Run tool on your dataset

4. Solution Architecture

4.1 High-Level Design

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   task   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ User/CLI      โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถโ”‚ Tool-Maker Agent  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ”‚ (plan + generate) โ”‚
                           โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚ writes candidate tool
                                   โ–ผ
                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚ Tool Workspace โ”‚
                            โ”‚ (staging)      โ”‚
                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚ run
                                   โ–ผ
                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚ Sandbox Runner โ”‚
                            โ”‚ (limits)       โ”‚
                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚ results
                                   โ–ผ
                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚ Validator      โ”‚
                            โ”‚ (tests+rules)  โ”‚
                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                   โ”‚ approve/register
                                   โ–ผ
                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚ Tool Registry  โ”‚
                            โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

4.2 Key Components

Component Responsibility Key Decisions
Tool-maker agent produce candidate code strict templates + allowed APIs
Sandbox runner execute untrusted code Docker/E2B; resource + network limits
Validator decide if tool is acceptable test suite + static checks
Registry persist tools safely versioning, metadata, approval flow

4.3 Data Structures

from dataclasses import dataclass

@dataclass(frozen=True)
class ToolSpec:
    name: str
    description: str
    input_schema: dict
    output_schema: dict
    version: str

@dataclass(frozen=True)
class SandboxResult:
    exit_code: int
    stdout: str
    stderr: str
    runtime_ms: int

4.4 Algorithm Overview

Key Algorithm: create-and-use loop

  1. Detect that current tools canโ€™t solve the task.
  2. Generate a tool spec and implementation skeleton.
  3. Generate tests (unit tests + property-ish checks where possible).
  4. Execute tests in sandbox with strict limits.
  5. If tests pass and policy allows, register tool; otherwise iterate (bounded).
  6. Use the new tool to solve the user task; record trace.

Complexity Analysis:

  • Time: O(tool_gen_attempts ร— sandbox_runs)
  • Space: O(staged_code + logs + tool registry)

5. Implementation Guide

5.1 Development Environment Setup

pip install pydantic rich

5.2 Project Structure

self-improving-agent/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ cli.py
โ”‚   โ”œโ”€โ”€ agent.py
โ”‚   โ”œโ”€โ”€ registry.py
โ”‚   โ”œโ”€โ”€ templates/
โ”‚   โ”œโ”€โ”€ sandbox.py
โ”‚   โ”œโ”€โ”€ validate.py
โ”‚   โ””โ”€โ”€ policy.py
โ””โ”€โ”€ tools/
    โ””โ”€โ”€ generated/

5.3 Implementation Phases

Phase 1: Strict tool templates + registry (8โ€“12h)

Goals:

  • Create and load tools from disk (without execution yet).

Tasks:

  1. Define ToolSpec and a file layout for tools (code + spec + tests).
  2. Implement registry load/list and versioning.

Checkpoint: The assistant can list tools and their schemas.

Phase 2: Sandbox runner + validation (12โ€“18h)

Goals:

  • Execute tools in a sandbox and validate outputs.

Tasks:

  1. Implement sandbox execution with timeouts and no network.
  2. Run tool tests in sandbox and parse results.
  3. Implement policy checks (disallow imports, file writes, network).

Checkpoint: A manually written tool can be tested and run safely.

Phase 3: Tool-making loop + bounded self-correction (20โ€“30h)

Goals:

  • Have the agent generate new tools and iterate on failures.

Tasks:

  1. Create prompts that output code + tests in strict templates.
  2. Feed sandbox stderr back into the agent for one or two repair attempts.
  3. Add human approval before registration (recommended default).

Checkpoint: The system can generate a small tool (e.g., CSV stats) and use it.

5.4 Key Implementation Decisions

Decision Options Recommendation Rationale
Sandbox Docker vs E2B whichever you can run reliably isolation is mandatory
Approval auto-register vs manual approve manual approve safety and governance
Validation tests only vs tests + static rules both tests miss malicious behavior

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit registry/policy tool spec parsing, forbidden imports
Integration sandbox no-network enforcement, timeout behavior
Scenario tool-making generate tool, tests pass, registry update

6.2 Critical Test Cases

  1. Resource limits: infinite loop tool gets killed by timeout.
  2. Network block: tool tries to fetch a URL and fails.
  3. Filesystem scope: tool canโ€™t write outside staging directory.

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
โ€œSandboxโ€ isnโ€™t isolated tool accesses host files use real container/VM isolation
Unbounded retries runaway cost/time max attempts + clear stop criteria
Weak validation tool passes but wrong add golden tests and schema validation
Tool sprawl too many similar tools versioning + consolidation + deprecation

Debugging strategies:

  • Keep every attempt artifact (code, tests, stderr) and diff attempts.
  • Start with a narrow allowlist of tool types (parsers, formatters).

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a โ€œtool galleryโ€ UI that previews specs and test status.
  • Add automatic documentation for new tools.

8.2 Intermediate Extensions

  • Add tool usage analytics (which tools are most helpful?).
  • Add โ€œtool refactoringโ€: merge duplicates under one interface.

8.3 Advanced Extensions

  • Add formal verification-ish checks (static analysis, import restrictions).
  • Add multi-agent tool-making (separate generator, tester, security reviewer).

9. Real-World Connections

9.1 Industry Applications

  • Secure code execution for agent workflows (data transformation, automation).
  • Internal copilots that generate scripts and validate them before use.

9.3 Interview Relevance

  • Sandboxing, capability gating, safe tool execution, and governance.

10. Resources

10.1 Essential Reading

  • AI Engineering (Chip Huyen) โ€” agentic workflows and safety (Ch. 6)
  • Sandbox platform docs (Docker/E2B) and secure execution patterns

10.3 Tools & Documentation

  • Docker security best practices (no privileged containers, seccomp)
  • Python subprocess and resource limits patterns
  • Previous: Project 8 (multi-agent) โ€” separate roles for generation/testing/security
  • Previous: Project 10 (monitoring) โ€” observe and govern a self-extending system

11. Self-Assessment Checklist

  • I can explain why sandboxing is mandatory and what threat model I used.
  • I can show that tools canโ€™t access network or host filesystem.
  • I can validate tool outputs and reject broken/malicious tools.
  • I can govern tool registration with approvals and versioning.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Tool registry + sandbox runner with strict limits
  • Agent can generate a small tool and run tests in sandbox
  • Manual approval gate before tool registration

Full Completion:

  • Bounded self-correction loop using sandbox stderr
  • Persistent tool versioning and audit trail

Excellence (Going Above & Beyond):

  • Multi-agent generation/testing/security review and automated eval suite

This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.