Project 14: Production Memory Service

Build a production-ready memory service with multi-tenancy, rate limiting, observability, and deployment configuration for real-world AI applications.

Quick Reference

Attribute Value
Difficulty Level 5: Master
Time Estimate 3-4 weeks (40-50 hours)
Language Python
Prerequisites Projects 1-13, DevOps, Docker, API design
Key Topics Multi-tenancy, rate limiting, observability, caching, deployment, API versioning, security

1. Learning Objectives

By completing this project, you will:

  1. Design multi-tenant memory architecture with data isolation.
  2. Implement rate limiting and quota management.
  3. Build comprehensive observability (metrics, logs, traces).
  4. Create production deployment with Docker/K8s.
  5. Handle security, authentication, and API versioning.

2. Theoretical Foundation

2.1 Core Concepts

  • Multi-Tenancy: Serving multiple customers from a single deployment with data isolation.

  • Rate Limiting: Protecting the service from abuse and ensuring fair resource allocation.

  • Observability: Metrics (numbers), logs (events), traces (request flows) for understanding system behavior.

  • Caching: Reducing latency and load by storing frequently accessed data.

  • API Versioning: Evolving the API without breaking existing clients.

2.2 Why This Matters

A memory system that works in development often fails in production:

  • One customer’s traffic shouldn’t affect others
  • Runaway requests shouldn’t crash the system
  • When something breaks, you need to know why
  • Deployment should be repeatable and safe

2.3 Common Misconceptions

  • “Multi-tenancy is just adding user_id.” True isolation requires careful architecture.
  • “Rate limiting is optional.” Without it, one bad actor can take down your service.
  • “Logs are enough.” You need metrics for trends and traces for debugging.

2.4 ASCII Diagram: Production Architecture

PRODUCTION MEMORY SERVICE ARCHITECTURE
══════════════════════════════════════════════════════════════

                         CLIENTS
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                      LOAD BALANCER                           │
│                     (nginx / ALB)                            │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                       API GATEWAY                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │
│  │    Auth     │ │    Rate     │ │   Request   │            │
│  │  Middleware │─▶│   Limiter   │─▶│   Router    │            │
│  └─────────────┘ └─────────────┘ └─────────────┘            │
│         │               │               │                    │
│         ▼               ▼               ▼                    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              OBSERVABILITY LAYER                     │    │
│  │  • Request logging (structured JSON)                │    │
│  │  • Metrics emission (latency, errors, throughput)   │    │
│  │  • Trace propagation (request IDs)                  │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    MEMORY SERVICE                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                 API VERSIONS                          │  │
│  │  /v1/memories  →  V1Handler (stable)                 │  │
│  │  /v2/memories  →  V2Handler (beta)                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│                            ▼                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │               TENANT ISOLATION                        │  │
│  │                                                       │  │
│  │  Request: tenant_id = "acme_corp"                    │  │
│  │    → Route to tenant-specific:                       │  │
│  │      • Database namespace                            │  │
│  │      • Cache partition                               │  │
│  │      • Rate limit bucket                             │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│              ┌─────────────┴─────────────┐                  │
│              │                           │                   │
│              ▼                           ▼                   │
│     ┌─────────────┐              ┌─────────────┐            │
│     │   Cache     │              │   Memory    │            │
│     │   Layer     │              │   Core      │            │
│     │   (Redis)   │◀────────────▶│  (Projects  │            │
│     └─────────────┘              │   1-13)     │            │
│                                  └─────────────┘            │
│                                         │                    │
└─────────────────────────────────────────┼────────────────────┘
                                          │
                                          ▼
┌─────────────────────────────────────────────────────────────┐
│                     DATA STORES                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Neo4j     │  │  ChromaDB   │  │  PostgreSQL │         │
│  │   (Graph)   │  │  (Vectors)  │  │  (Metadata) │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
│  Data Isolation:                                             │
│  • Tenant "acme": graph_acme, chroma_acme, schema acme     │
│  • Tenant "beta": graph_beta, chroma_beta, schema beta     │
│                                                              │
└─────────────────────────────────────────────────────────────┘


OBSERVABILITY STACK
═══════════════════

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  METRICS (Prometheus + Grafana)                             │
│  ─────────────────────────────                              │
│  • memory_requests_total{tenant, method, status}            │
│  • memory_request_duration_seconds{tenant, method}          │
│  • memory_entities_count{tenant}                            │
│  • memory_cache_hit_ratio{tenant}                           │
│                                                              │
│  LOGS (ELK or Loki)                                         │
│  ─────────────────                                          │
│  {                                                          │
│    "timestamp": "2024-12-15T10:30:00Z",                     │
│    "level": "info",                                         │
│    "tenant_id": "acme",                                     │
│    "request_id": "req_abc123",                              │
│    "method": "search",                                      │
│    "latency_ms": 45,                                        │
│    "entities_returned": 5                                   │
│  }                                                          │
│                                                              │
│  TRACES (Jaeger)                                            │
│  ──────────────                                             │
│  request_id: req_abc123                                     │
│  ├── api_gateway (5ms)                                      │
│  ├── rate_limiter (1ms)                                     │
│  ├── memory_service (35ms)                                  │
│  │   ├── cache_lookup (2ms) → MISS                         │
│  │   ├── vector_search (20ms)                              │
│  │   └── graph_traverse (10ms)                             │
│  └── response (2ms)                                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘


RATE LIMITING STRATEGY
═════════════════════

┌─────────────────────────────────────────────────────────────┐
│  Tier: Free                                                  │
│  ─────────                                                  │
│  • 100 requests/minute per tenant                           │
│  • 1,000 memories maximum                                   │
│  • No priority queue                                        │
├─────────────────────────────────────────────────────────────┤
│  Tier: Pro                                                   │
│  ─────────                                                  │
│  • 1,000 requests/minute per tenant                         │
│  • 100,000 memories maximum                                 │
│  • Priority queue access                                    │
├─────────────────────────────────────────────────────────────┤
│  Tier: Enterprise                                           │
│  ──────────────                                             │
│  • Custom limits                                            │
│  • Dedicated resources                                      │
│  • SLA guarantees                                           │
└─────────────────────────────────────────────────────────────┘

Algorithm: Token Bucket
- Each tenant has a bucket with capacity C
- Tokens replenish at rate R per second
- Request consumes 1 token
- If bucket empty → 429 Too Many Requests

3. Project Specification

3.1 What You Will Build

A production-ready memory service with:

  • Multi-tenant data isolation
  • Rate limiting by tier
  • Comprehensive observability
  • Docker/K8s deployment

3.2 Functional Requirements

  1. Tenant management: POST /admin/tenants → Create tenant
  2. API key auth: Authorization: Bearer sk-xxx
  3. Rate limiting: Return 429 when exceeded
  4. Memory CRUD: Standard memory API with tenant isolation
  5. Metrics endpoint: GET /metrics (Prometheus format)
  6. Health check: GET /health → Service status

3.3 Non-Functional Requirements

  • Latency: P99 < 200ms for search
  • Availability: 99.9% uptime target
  • Throughput: 10,000 requests/second
  • Security: API key auth, tenant isolation

3.3 Example Usage / Output

# Create tenant (admin only)
curl -X POST https://api.memory.example/admin/tenants \
  -H "Authorization: Bearer admin_key" \
  -d '{"name": "acme_corp", "tier": "pro"}'

# Response
{
  "tenant_id": "tenant_abc123",
  "api_key": "sk-acme-xxxxx",
  "tier": "pro",
  "limits": {
    "requests_per_minute": 1000,
    "max_memories": 100000
  }
}

# Use memory API with tenant key
curl -X POST https://api.memory.example/v1/memories \
  -H "Authorization: Bearer sk-acme-xxxxx" \
  -d '{"content": "User prefers Python"}'

# Response
{
  "id": "mem_xyz789",
  "content": "User prefers Python",
  "created_at": "2024-12-15T10:30:00Z"
}

# Rate limit exceeded
curl -X POST https://api.memory.example/v1/memories \
  -H "Authorization: Bearer sk-acme-xxxxx" \
  -d '{"content": "Another memory"}'

# Response (429)
{
  "error": "rate_limit_exceeded",
  "retry_after": 30,
  "limit": 1000,
  "remaining": 0,
  "reset_at": "2024-12-15T10:31:00Z"
}

# Health check
curl https://api.memory.example/health

# Response
{
  "status": "healthy",
  "components": {
    "neo4j": "healthy",
    "chromadb": "healthy",
    "redis": "healthy"
  },
  "version": "1.2.0"
}

# Metrics (Prometheus format)
curl https://api.memory.example/metrics

# memory_requests_total{tenant="acme",method="search",status="200"} 1234
# memory_request_duration_seconds_bucket{tenant="acme",le="0.1"} 1100
# memory_entities_count{tenant="acme"} 50000

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│    Client     │────▶│ Load Balancer │────▶│  API Gateway  │
└───────────────┘     └───────────────┘     └───────┬───────┘
                                                    │
                                          ┌─────────┴─────────┐
                                          │                   │
                                          ▼                   ▼
                                   ┌───────────┐       ┌───────────┐
                                   │  Service  │       │  Service  │
                                   │ Instance 1│       │ Instance 2│
                                   └─────┬─────┘       └─────┬─────┘
                                         │                   │
                                         └─────────┬─────────┘
                                                   │
                         ┌─────────────────────────┼─────────────────────────┐
                         │                         │                         │
                         ▼                         ▼                         ▼
                  ┌───────────┐             ┌───────────┐             ┌───────────┐
                  │   Neo4j   │             │  ChromaDB │             │   Redis   │
                  │  Cluster  │             │           │             │  Cluster  │
                  └───────────┘             └───────────┘             └───────────┘

4.2 Key Components

Component Responsibility Technology
Load Balancer Distribute traffic nginx / AWS ALB
API Gateway Auth, rate limit, routing FastAPI + Redis
Memory Service Core business logic Python (Projects 1-13)
Cache Response caching Redis
Metrics Observability Prometheus
Logs Event logging Loki / ELK
Traces Request tracing Jaeger / OpenTelemetry

4.3 Data Models

from pydantic import BaseModel
from datetime import datetime
from typing import Literal

class Tenant(BaseModel):
    id: str
    name: str
    api_key_hash: str
    tier: Literal["free", "pro", "enterprise"]
    created_at: datetime

class TenantLimits(BaseModel):
    requests_per_minute: int
    max_memories: int
    max_graph_nodes: int

class RateLimitInfo(BaseModel):
    limit: int
    remaining: int
    reset_at: datetime

class HealthStatus(BaseModel):
    status: Literal["healthy", "degraded", "unhealthy"]
    components: dict[str, str]
    version: str

5. Implementation Guide

5.1 Development Environment Setup

mkdir production-memory && cd production-memory
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn redis prometheus-client structlog opentelemetry-api

5.2 Project Structure

production-memory/
├── src/
│   ├── main.py              # FastAPI app
│   ├── config.py            # Configuration
│   ├── middleware/
│   │   ├── auth.py          # Authentication
│   │   ├── rate_limit.py    # Rate limiting
│   │   └── observability.py # Logging, metrics, traces
│   ├── routers/
│   │   ├── v1/              # API v1
│   │   └── admin/           # Admin endpoints
│   ├── services/
│   │   ├── tenant.py        # Tenant management
│   │   └── memory.py        # Memory operations
│   └── models.py            # Data models
├── deploy/
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── kubernetes/
│       ├── deployment.yaml
│       └── service.yaml
├── tests/
│   ├── test_auth.py
│   ├── test_rate_limit.py
│   └── load_test.py
└── README.md

5.3 Implementation Phases

Phase 1: Multi-Tenancy (10-12h)

Goals:

  • Tenant management working
  • Data isolation enforced

Tasks:

  1. Build tenant registration endpoint
  2. Implement API key authentication
  3. Add tenant context to all requests
  4. Create tenant-isolated data stores

Checkpoint: Each tenant sees only their data.

Phase 2: Rate Limiting (8-10h)

Goals:

  • Token bucket rate limiting
  • Per-tenant limits by tier

Tasks:

  1. Implement token bucket with Redis
  2. Add rate limit headers to responses
  3. Create tier-based limit configuration
  4. Handle rate limit exceeded gracefully

Checkpoint: Rate limits enforced correctly.

Phase 3: Observability (10-12h)

Goals:

  • Prometheus metrics
  • Structured logging
  • Distributed tracing

Tasks:

  1. Add Prometheus metrics middleware
  2. Implement structured JSON logging
  3. Add OpenTelemetry tracing
  4. Create health check endpoint

Checkpoint: Full observability stack working.

Phase 4: Deployment (10-12h)

Goals:

  • Docker containerization
  • Kubernetes manifests
  • Production configuration

Tasks:

  1. Create Dockerfile
  2. Build docker-compose for local dev
  3. Create K8s deployment manifests
  4. Document production checklist

Checkpoint: Service deployable to production.


6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Test components Rate limiter logic
Integration Test full stack Auth → rate limit → memory
Load Test performance 1000 concurrent requests
Security Test isolation Cross-tenant access attempts

6.2 Critical Test Cases

  1. Auth required: Unauthenticated requests rejected
  2. Tenant isolation: Tenant A can’t see Tenant B’s data
  3. Rate limiting: Limits enforced per tenant
  4. Graceful degradation: Service handles failures

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Shared state Cross-tenant data leak Strict namespace isolation
Redis bottleneck Rate limiting slow Use Redis cluster
Missing context Can’t debug requests Add request_id everywhere
Memory leaks Service crashes over time Profile with py-spy

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add usage dashboard per tenant
  • Implement webhook notifications

8.2 Intermediate Extensions

  • Add billing integration (Stripe)
  • Implement tenant SSO

8.3 Advanced Extensions

  • Add multi-region deployment
  • Implement tenant resource quotas

9. Real-World Connections

9.1 Industry Applications

  • OpenAI API: Multi-tenant LLM service
  • Pinecone: Multi-tenant vector service
  • Zep Cloud: Production memory service

9.2 Interview Relevance

  • Explain multi-tenancy strategies
  • Discuss rate limiting algorithms
  • Describe observability best practices

10. Resources

10.1 Essential Reading

  • “Designing Data-Intensive Applications” by Kleppmann — Ch. on Reliability
  • “Site Reliability Engineering” by Google — Production practices
  • 12-Factor App — Cloud-native principles
  • Previous: Project 13 (Multi-Agent Shared Memory)
  • Next: Project 15 (Memory Benchmark Suite)

11. Self-Assessment Checklist

  • I can design multi-tenant data isolation
  • I understand token bucket rate limiting
  • I can implement production observability
  • I know deployment best practices

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Multi-tenant isolation working
  • Rate limiting implemented
  • Basic health checks

Full Completion:

  • Prometheus metrics
  • Structured logging
  • Docker deployment

Excellence:

  • Kubernetes manifests
  • Load testing results
  • Production runbook