Project 14: Production Memory Service
Build a production-ready memory service with multi-tenancy, rate limiting, observability, and deployment configuration for real-world AI applications.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 5: Master |
| Time Estimate | 3-4 weeks (40-50 hours) |
| Language | Python |
| Prerequisites | Projects 1-13, DevOps, Docker, API design |
| Key Topics | Multi-tenancy, rate limiting, observability, caching, deployment, API versioning, security |
1. Learning Objectives
By completing this project, you will:
- Design multi-tenant memory architecture with data isolation.
- Implement rate limiting and quota management.
- Build comprehensive observability (metrics, logs, traces).
- Create production deployment with Docker/K8s.
- Handle security, authentication, and API versioning.
2. Theoretical Foundation
2.1 Core Concepts
-
Multi-Tenancy: Serving multiple customers from a single deployment with data isolation.
-
Rate Limiting: Protecting the service from abuse and ensuring fair resource allocation.
-
Observability: Metrics (numbers), logs (events), traces (request flows) for understanding system behavior.
-
Caching: Reducing latency and load by storing frequently accessed data.
-
API Versioning: Evolving the API without breaking existing clients.
2.2 Why This Matters
A memory system that works in development often fails in production:
- One customer’s traffic shouldn’t affect others
- Runaway requests shouldn’t crash the system
- When something breaks, you need to know why
- Deployment should be repeatable and safe
2.3 Common Misconceptions
- “Multi-tenancy is just adding user_id.” True isolation requires careful architecture.
- “Rate limiting is optional.” Without it, one bad actor can take down your service.
- “Logs are enough.” You need metrics for trends and traces for debugging.
2.4 ASCII Diagram: Production Architecture
PRODUCTION MEMORY SERVICE ARCHITECTURE
══════════════════════════════════════════════════════════════
CLIENTS
│
▼
┌─────────────────────────────────────────────────────────────┐
│ LOAD BALANCER │
│ (nginx / ALB) │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ API GATEWAY │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Auth │ │ Rate │ │ Request │ │
│ │ Middleware │─▶│ Limiter │─▶│ Router │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ OBSERVABILITY LAYER │ │
│ │ • Request logging (structured JSON) │ │
│ │ • Metrics emission (latency, errors, throughput) │ │
│ │ • Trace propagation (request IDs) │ │
│ └─────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ MEMORY SERVICE │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ API VERSIONS │ │
│ │ /v1/memories → V1Handler (stable) │ │
│ │ /v2/memories → V2Handler (beta) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ TENANT ISOLATION │ │
│ │ │ │
│ │ Request: tenant_id = "acme_corp" │ │
│ │ → Route to tenant-specific: │ │
│ │ • Database namespace │ │
│ │ • Cache partition │ │
│ │ • Rate limit bucket │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────┴─────────────┐ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ Cache │ │ Memory │ │
│ │ Layer │ │ Core │ │
│ │ (Redis) │◀────────────▶│ (Projects │ │
│ └─────────────┘ │ 1-13) │ │
│ └─────────────┘ │
│ │ │
└─────────────────────────────────────────┼────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ DATA STORES │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Neo4j │ │ ChromaDB │ │ PostgreSQL │ │
│ │ (Graph) │ │ (Vectors) │ │ (Metadata) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Data Isolation: │
│ • Tenant "acme": graph_acme, chroma_acme, schema acme │
│ • Tenant "beta": graph_beta, chroma_beta, schema beta │
│ │
└─────────────────────────────────────────────────────────────┘
OBSERVABILITY STACK
═══════════════════
┌─────────────────────────────────────────────────────────────┐
│ │
│ METRICS (Prometheus + Grafana) │
│ ───────────────────────────── │
│ • memory_requests_total{tenant, method, status} │
│ • memory_request_duration_seconds{tenant, method} │
│ • memory_entities_count{tenant} │
│ • memory_cache_hit_ratio{tenant} │
│ │
│ LOGS (ELK or Loki) │
│ ───────────────── │
│ { │
│ "timestamp": "2024-12-15T10:30:00Z", │
│ "level": "info", │
│ "tenant_id": "acme", │
│ "request_id": "req_abc123", │
│ "method": "search", │
│ "latency_ms": 45, │
│ "entities_returned": 5 │
│ } │
│ │
│ TRACES (Jaeger) │
│ ────────────── │
│ request_id: req_abc123 │
│ ├── api_gateway (5ms) │
│ ├── rate_limiter (1ms) │
│ ├── memory_service (35ms) │
│ │ ├── cache_lookup (2ms) → MISS │
│ │ ├── vector_search (20ms) │
│ │ └── graph_traverse (10ms) │
│ └── response (2ms) │
│ │
└─────────────────────────────────────────────────────────────┘
RATE LIMITING STRATEGY
═════════════════════
┌─────────────────────────────────────────────────────────────┐
│ Tier: Free │
│ ───────── │
│ • 100 requests/minute per tenant │
│ • 1,000 memories maximum │
│ • No priority queue │
├─────────────────────────────────────────────────────────────┤
│ Tier: Pro │
│ ───────── │
│ • 1,000 requests/minute per tenant │
│ • 100,000 memories maximum │
│ • Priority queue access │
├─────────────────────────────────────────────────────────────┤
│ Tier: Enterprise │
│ ────────────── │
│ • Custom limits │
│ • Dedicated resources │
│ • SLA guarantees │
└─────────────────────────────────────────────────────────────┘
Algorithm: Token Bucket
- Each tenant has a bucket with capacity C
- Tokens replenish at rate R per second
- Request consumes 1 token
- If bucket empty → 429 Too Many Requests
3. Project Specification
3.1 What You Will Build
A production-ready memory service with:
- Multi-tenant data isolation
- Rate limiting by tier
- Comprehensive observability
- Docker/K8s deployment
3.2 Functional Requirements
- Tenant management:
POST /admin/tenants→ Create tenant - API key auth:
Authorization: Bearer sk-xxx - Rate limiting: Return 429 when exceeded
- Memory CRUD: Standard memory API with tenant isolation
- Metrics endpoint:
GET /metrics(Prometheus format) - Health check:
GET /health→ Service status
3.3 Non-Functional Requirements
- Latency: P99 < 200ms for search
- Availability: 99.9% uptime target
- Throughput: 10,000 requests/second
- Security: API key auth, tenant isolation
3.3 Example Usage / Output
# Create tenant (admin only)
curl -X POST https://api.memory.example/admin/tenants \
-H "Authorization: Bearer admin_key" \
-d '{"name": "acme_corp", "tier": "pro"}'
# Response
{
"tenant_id": "tenant_abc123",
"api_key": "sk-acme-xxxxx",
"tier": "pro",
"limits": {
"requests_per_minute": 1000,
"max_memories": 100000
}
}
# Use memory API with tenant key
curl -X POST https://api.memory.example/v1/memories \
-H "Authorization: Bearer sk-acme-xxxxx" \
-d '{"content": "User prefers Python"}'
# Response
{
"id": "mem_xyz789",
"content": "User prefers Python",
"created_at": "2024-12-15T10:30:00Z"
}
# Rate limit exceeded
curl -X POST https://api.memory.example/v1/memories \
-H "Authorization: Bearer sk-acme-xxxxx" \
-d '{"content": "Another memory"}'
# Response (429)
{
"error": "rate_limit_exceeded",
"retry_after": 30,
"limit": 1000,
"remaining": 0,
"reset_at": "2024-12-15T10:31:00Z"
}
# Health check
curl https://api.memory.example/health
# Response
{
"status": "healthy",
"components": {
"neo4j": "healthy",
"chromadb": "healthy",
"redis": "healthy"
},
"version": "1.2.0"
}
# Metrics (Prometheus format)
curl https://api.memory.example/metrics
# memory_requests_total{tenant="acme",method="search",status="200"} 1234
# memory_request_duration_seconds_bucket{tenant="acme",le="0.1"} 1100
# memory_entities_count{tenant="acme"} 50000
4. Solution Architecture
4.1 High-Level Design
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ Client │────▶│ Load Balancer │────▶│ API Gateway │
└───────────────┘ └───────────────┘ └───────┬───────┘
│
┌─────────┴─────────┐
│ │
▼ ▼
┌───────────┐ ┌───────────┐
│ Service │ │ Service │
│ Instance 1│ │ Instance 2│
└─────┬─────┘ └─────┬─────┘
│ │
└─────────┬─────────┘
│
┌─────────────────────────┼─────────────────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Neo4j │ │ ChromaDB │ │ Redis │
│ Cluster │ │ │ │ Cluster │
└───────────┘ └───────────┘ └───────────┘
4.2 Key Components
| Component | Responsibility | Technology |
|---|---|---|
| Load Balancer | Distribute traffic | nginx / AWS ALB |
| API Gateway | Auth, rate limit, routing | FastAPI + Redis |
| Memory Service | Core business logic | Python (Projects 1-13) |
| Cache | Response caching | Redis |
| Metrics | Observability | Prometheus |
| Logs | Event logging | Loki / ELK |
| Traces | Request tracing | Jaeger / OpenTelemetry |
4.3 Data Models
from pydantic import BaseModel
from datetime import datetime
from typing import Literal
class Tenant(BaseModel):
id: str
name: str
api_key_hash: str
tier: Literal["free", "pro", "enterprise"]
created_at: datetime
class TenantLimits(BaseModel):
requests_per_minute: int
max_memories: int
max_graph_nodes: int
class RateLimitInfo(BaseModel):
limit: int
remaining: int
reset_at: datetime
class HealthStatus(BaseModel):
status: Literal["healthy", "degraded", "unhealthy"]
components: dict[str, str]
version: str
5. Implementation Guide
5.1 Development Environment Setup
mkdir production-memory && cd production-memory
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn redis prometheus-client structlog opentelemetry-api
5.2 Project Structure
production-memory/
├── src/
│ ├── main.py # FastAPI app
│ ├── config.py # Configuration
│ ├── middleware/
│ │ ├── auth.py # Authentication
│ │ ├── rate_limit.py # Rate limiting
│ │ └── observability.py # Logging, metrics, traces
│ ├── routers/
│ │ ├── v1/ # API v1
│ │ └── admin/ # Admin endpoints
│ ├── services/
│ │ ├── tenant.py # Tenant management
│ │ └── memory.py # Memory operations
│ └── models.py # Data models
├── deploy/
│ ├── Dockerfile
│ ├── docker-compose.yml
│ └── kubernetes/
│ ├── deployment.yaml
│ └── service.yaml
├── tests/
│ ├── test_auth.py
│ ├── test_rate_limit.py
│ └── load_test.py
└── README.md
5.3 Implementation Phases
Phase 1: Multi-Tenancy (10-12h)
Goals:
- Tenant management working
- Data isolation enforced
Tasks:
- Build tenant registration endpoint
- Implement API key authentication
- Add tenant context to all requests
- Create tenant-isolated data stores
Checkpoint: Each tenant sees only their data.
Phase 2: Rate Limiting (8-10h)
Goals:
- Token bucket rate limiting
- Per-tenant limits by tier
Tasks:
- Implement token bucket with Redis
- Add rate limit headers to responses
- Create tier-based limit configuration
- Handle rate limit exceeded gracefully
Checkpoint: Rate limits enforced correctly.
Phase 3: Observability (10-12h)
Goals:
- Prometheus metrics
- Structured logging
- Distributed tracing
Tasks:
- Add Prometheus metrics middleware
- Implement structured JSON logging
- Add OpenTelemetry tracing
- Create health check endpoint
Checkpoint: Full observability stack working.
Phase 4: Deployment (10-12h)
Goals:
- Docker containerization
- Kubernetes manifests
- Production configuration
Tasks:
- Create Dockerfile
- Build docker-compose for local dev
- Create K8s deployment manifests
- Document production checklist
Checkpoint: Service deployable to production.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Test components | Rate limiter logic |
| Integration | Test full stack | Auth → rate limit → memory |
| Load | Test performance | 1000 concurrent requests |
| Security | Test isolation | Cross-tenant access attempts |
6.2 Critical Test Cases
- Auth required: Unauthenticated requests rejected
- Tenant isolation: Tenant A can’t see Tenant B’s data
- Rate limiting: Limits enforced per tenant
- Graceful degradation: Service handles failures
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Shared state | Cross-tenant data leak | Strict namespace isolation |
| Redis bottleneck | Rate limiting slow | Use Redis cluster |
| Missing context | Can’t debug requests | Add request_id everywhere |
| Memory leaks | Service crashes over time | Profile with py-spy |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add usage dashboard per tenant
- Implement webhook notifications
8.2 Intermediate Extensions
- Add billing integration (Stripe)
- Implement tenant SSO
8.3 Advanced Extensions
- Add multi-region deployment
- Implement tenant resource quotas
9. Real-World Connections
9.1 Industry Applications
- OpenAI API: Multi-tenant LLM service
- Pinecone: Multi-tenant vector service
- Zep Cloud: Production memory service
9.2 Interview Relevance
- Explain multi-tenancy strategies
- Discuss rate limiting algorithms
- Describe observability best practices
10. Resources
10.1 Essential Reading
- “Designing Data-Intensive Applications” by Kleppmann — Ch. on Reliability
- “Site Reliability Engineering” by Google — Production practices
- 12-Factor App — Cloud-native principles
10.2 Related Projects
- Previous: Project 13 (Multi-Agent Shared Memory)
- Next: Project 15 (Memory Benchmark Suite)
11. Self-Assessment Checklist
- I can design multi-tenant data isolation
- I understand token bucket rate limiting
- I can implement production observability
- I know deployment best practices
12. Submission / Completion Criteria
Minimum Viable Completion:
- Multi-tenant isolation working
- Rate limiting implemented
- Basic health checks
Full Completion:
- Prometheus metrics
- Structured logging
- Docker deployment
Excellence:
- Kubernetes manifests
- Load testing results
- Production runbook