Project 12: The Self-Improving Assistant (Agentic Tool-Maker)
Build an assistant that can create new tools for itself: write code, run it in a sandbox, validate output, and register the tool for future use—safely.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 5: Master |
| Time Estimate | 40–60 hours |
| Language | Python |
| Prerequisites | Strong tool/agent fundamentals, sandboxing/security mindset, debugging experience |
| Key Topics | sandboxed code execution, capability gating, self-correction, persistence, security constraints |
1. Learning Objectives
By completing this project, you will:
- Build a “tool creation loop”: decide need → write code → test → use.
- Execute untrusted code in a sandbox with strict resource limits.
- Design validation and scoring so the agent doesn’t register broken tools.
- Implement persistence for discovered tools with safe metadata.
- Prevent common security failures: exfiltration, fork bombs, filesystem abuse.
2. Theoretical Foundation
2.1 Core Concepts
- Tool-making vs tool-using: Tool-using chooses among known actions; tool-making expands the action space.
- Sandboxing: Untrusted code must run in isolation (containers/VMs) with limits on CPU, memory, disk, and network.
- Capability gating: Not all tools should be creatable; define allowed domains (data processing, parsing) and forbidden ones (network scanners).
- Validation: The agent must test tools against cases and prove outputs meet constraints before registration.
- Security as product: Recursive agency without guardrails becomes an attack surface.
2.2 Why This Matters
This is the frontier of “autonomy”: systems that can extend themselves. Even if you never ship self-writing tools, the sandboxing and safety engineering skills are directly relevant to any tool-using agent.
2.3 Common Misconceptions
- “Just run code with
exec().” That is not a sandbox. - “We can trust the model.” Treat model outputs as untrusted input.
- “If tests pass once, it’s safe.” You need ongoing constraints and monitoring.
3. Project Specification
3.1 What You Will Build
An assistant that:
- Receives a user task (e.g., “Summarize sentiment in these logs”)
- Detects missing capability
- Proposes and generates a new tool (Python module/function)
- Runs it in a sandbox with tests
- Registers the tool and uses it to complete the task
3.2 Functional Requirements
- Tool registry: existing tools + newly created ones, with metadata.
- Tool authoring: generate code with a strict template (inputs/outputs).
- Sandbox execution: run tool code with resource limits and captured stdout/stderr.
- Validation: auto-generate test cases and run them before registration.
- Self-correction: on failure, feed back errors and attempt a bounded fix.
- Persistence: store tools on disk with versions and audit trail.
3.3 Non-Functional Requirements
- Security: no arbitrary network access; filesystem write scope limited.
- Reliability: avoid infinite retries; timeouts for tool creation and runs.
- Auditability: store “why tool was created” and “which tests passed”.
- Governance: a human approval step before saving a new tool (recommended).
3.4 Example Usage / Output
User: Analyze sentiment of these 500 JSON logs.
Assistant: I don’t have a sentiment tool. I will create one.
Plan:
1) Write parse_jsonl() + compute_sentiment()
2) Run tests on sample logs
3) Register tool "sentiment_analyzer_v1"
4) Run tool on your dataset
4. Solution Architecture
4.1 High-Level Design
┌──────────────┐ task ┌──────────────────┐
│ User/CLI │────────▶│ Tool-Maker Agent │
└──────────────┘ │ (plan + generate) │
└───────┬─────────┘
│ writes candidate tool
▼
┌───────────────┐
│ Tool Workspace │
│ (staging) │
└───────┬───────┘
│ run
▼
┌───────────────┐
│ Sandbox Runner │
│ (limits) │
└───────┬───────┘
│ results
▼
┌───────────────┐
│ Validator │
│ (tests+rules) │
└───────┬───────┘
│ approve/register
▼
┌───────────────┐
│ Tool Registry │
└───────────────┘
4.2 Key Components
| Component | Responsibility | Key Decisions |
|---|---|---|
| Tool-maker agent | produce candidate code | strict templates + allowed APIs |
| Sandbox runner | execute untrusted code | Docker/E2B; resource + network limits |
| Validator | decide if tool is acceptable | test suite + static checks |
| Registry | persist tools safely | versioning, metadata, approval flow |
4.3 Data Structures
from dataclasses import dataclass
@dataclass(frozen=True)
class ToolSpec:
name: str
description: str
input_schema: dict
output_schema: dict
version: str
@dataclass(frozen=True)
class SandboxResult:
exit_code: int
stdout: str
stderr: str
runtime_ms: int
4.4 Algorithm Overview
Key Algorithm: create-and-use loop
- Detect that current tools can’t solve the task.
- Generate a tool spec and implementation skeleton.
- Generate tests (unit tests + property-ish checks where possible).
- Execute tests in sandbox with strict limits.
- If tests pass and policy allows, register tool; otherwise iterate (bounded).
- Use the new tool to solve the user task; record trace.
Complexity Analysis:
- Time: O(tool_gen_attempts × sandbox_runs)
- Space: O(staged_code + logs + tool registry)
5. Implementation Guide
5.1 Development Environment Setup
pip install pydantic rich
5.2 Project Structure
self-improving-agent/
├── src/
│ ├── cli.py
│ ├── agent.py
│ ├── registry.py
│ ├── templates/
│ ├── sandbox.py
│ ├── validate.py
│ └── policy.py
└── tools/
└── generated/
5.3 Implementation Phases
Phase 1: Strict tool templates + registry (8–12h)
Goals:
- Create and load tools from disk (without execution yet).
Tasks:
- Define
ToolSpecand a file layout for tools (code + spec + tests). - Implement registry load/list and versioning.
Checkpoint: The assistant can list tools and their schemas.
Phase 2: Sandbox runner + validation (12–18h)
Goals:
- Execute tools in a sandbox and validate outputs.
Tasks:
- Implement sandbox execution with timeouts and no network.
- Run tool tests in sandbox and parse results.
- Implement policy checks (disallow imports, file writes, network).
Checkpoint: A manually written tool can be tested and run safely.
Phase 3: Tool-making loop + bounded self-correction (20–30h)
Goals:
- Have the agent generate new tools and iterate on failures.
Tasks:
- Create prompts that output code + tests in strict templates.
- Feed sandbox stderr back into the agent for one or two repair attempts.
- Add human approval before registration (recommended default).
Checkpoint: The system can generate a small tool (e.g., CSV stats) and use it.
5.4 Key Implementation Decisions
| Decision | Options | Recommendation | Rationale |
|---|---|---|---|
| Sandbox | Docker vs E2B | whichever you can run reliably | isolation is mandatory |
| Approval | auto-register vs manual approve | manual approve | safety and governance |
| Validation | tests only vs tests + static rules | both | tests miss malicious behavior |
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | registry/policy | tool spec parsing, forbidden imports |
| Integration | sandbox | no-network enforcement, timeout behavior |
| Scenario | tool-making | generate tool, tests pass, registry update |
6.2 Critical Test Cases
- Resource limits: infinite loop tool gets killed by timeout.
- Network block: tool tries to fetch a URL and fails.
- Filesystem scope: tool can’t write outside staging directory.
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| “Sandbox” isn’t isolated | tool accesses host files | use real container/VM isolation |
| Unbounded retries | runaway cost/time | max attempts + clear stop criteria |
| Weak validation | tool passes but wrong | add golden tests and schema validation |
| Tool sprawl | too many similar tools | versioning + consolidation + deprecation |
Debugging strategies:
- Keep every attempt artifact (code, tests, stderr) and diff attempts.
- Start with a narrow allowlist of tool types (parsers, formatters).
8. Extensions & Challenges
8.1 Beginner Extensions
- Add a “tool gallery” UI that previews specs and test status.
- Add automatic documentation for new tools.
8.2 Intermediate Extensions
- Add tool usage analytics (which tools are most helpful?).
- Add “tool refactoring”: merge duplicates under one interface.
8.3 Advanced Extensions
- Add formal verification-ish checks (static analysis, import restrictions).
- Add multi-agent tool-making (separate generator, tester, security reviewer).
9. Real-World Connections
9.1 Industry Applications
- Secure code execution for agent workflows (data transformation, automation).
- Internal copilots that generate scripts and validate them before use.
9.3 Interview Relevance
- Sandboxing, capability gating, safe tool execution, and governance.
10. Resources
10.1 Essential Reading
- AI Engineering (Chip Huyen) — agentic workflows and safety (Ch. 6)
- Sandbox platform docs (Docker/E2B) and secure execution patterns
10.3 Tools & Documentation
- Docker security best practices (no privileged containers, seccomp)
- Python subprocess and resource limits patterns
10.4 Related Projects in This Series
- Previous: Project 8 (multi-agent) — separate roles for generation/testing/security
- Previous: Project 10 (monitoring) — observe and govern a self-extending system
11. Self-Assessment Checklist
- I can explain why sandboxing is mandatory and what threat model I used.
- I can show that tools can’t access network or host filesystem.
- I can validate tool outputs and reject broken/malicious tools.
- I can govern tool registration with approvals and versioning.
12. Submission / Completion Criteria
Minimum Viable Completion:
- Tool registry + sandbox runner with strict limits
- Agent can generate a small tool and run tests in sandbox
- Manual approval gate before tool registration
Full Completion:
- Bounded self-correction loop using sandbox stderr
- Persistent tool versioning and audit trail
Excellence (Going Above & Beyond):
- Multi-agent generation/testing/security review and automated eval suite
This guide was generated from project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY.md. For the complete sprint overview, see project_based_ideas/AI_PERSONAL_ASSISTANTS_MASTERY/README.md.