Project 14: Production Memory Service

Build a production-ready memory service with multi-tenancy, rate limiting, observability, and deployment configuration for real-world AI applications.

Quick Reference

Attribute	Value
Difficulty	Level 5: Master
Time Estimate	3-4 weeks (40-50 hours)
Language	Python
Prerequisites	Projects 1-13, DevOps, Docker, API design
Key Topics	Multi-tenancy, rate limiting, observability, caching, deployment, API versioning, security

1. Learning Objectives

By completing this project, you will:

Design multi-tenant memory architecture with data isolation.
Implement rate limiting and quota management.
Build comprehensive observability (metrics, logs, traces).
Create production deployment with Docker/K8s.
Handle security, authentication, and API versioning.

2. Theoretical Foundation

2.1 Core Concepts

Multi-Tenancy: Serving multiple customers from a single deployment with data isolation.
Rate Limiting: Protecting the service from abuse and ensuring fair resource allocation.
Observability: Metrics (numbers), logs (events), traces (request flows) for understanding system behavior.
Caching: Reducing latency and load by storing frequently accessed data.
API Versioning: Evolving the API without breaking existing clients.

2.2 Why This Matters

A memory system that works in development often fails in production:

One customer’s traffic shouldn’t affect others
Runaway requests shouldn’t crash the system
When something breaks, you need to know why
Deployment should be repeatable and safe

2.3 Common Misconceptions

“Multi-tenancy is just adding user_id.” True isolation requires careful architecture.
“Rate limiting is optional.” Without it, one bad actor can take down your service.
“Logs are enough.” You need metrics for trends and traces for debugging.

2.4 ASCII Diagram: Production Architecture

PRODUCTION MEMORY SERVICE ARCHITECTURE
══════════════════════════════════════════════════════════════

                         CLIENTS
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                      LOAD BALANCER                           │
│                     (nginx / ALB)                            │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                       API GATEWAY                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐ ┌─────────────┐ ┌─────────────┐            │
│  │    Auth     │ │    Rate     │ │   Request   │            │
│  │  Middleware │─▶│   Limiter   │─▶│   Router    │            │
│  └─────────────┘ └─────────────┘ └─────────────┘            │
│         │               │               │                    │
│         ▼               ▼               ▼                    │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              OBSERVABILITY LAYER                     │    │
│  │  • Request logging (structured JSON)                │    │
│  │  • Metrics emission (latency, errors, throughput)   │    │
│  │  • Trace propagation (request IDs)                  │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
└───────────────────────────┬─────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    MEMORY SERVICE                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                 API VERSIONS                          │  │
│  │  /v1/memories  →  V1Handler (stable)                 │  │
│  │  /v2/memories  →  V2Handler (beta)                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│                            ▼                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │               TENANT ISOLATION                        │  │
│  │                                                       │  │
│  │  Request: tenant_id = "acme_corp"                    │  │
│  │    → Route to tenant-specific:                       │  │
│  │      • Database namespace                            │  │
│  │      • Cache partition                               │  │
│  │      • Rate limit bucket                             │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│              ┌─────────────┴─────────────┐                  │
│              │                           │                   │
│              ▼                           ▼                   │
│     ┌─────────────┐              ┌─────────────┐            │
│     │   Cache     │              │   Memory    │            │
│     │   Layer     │              │   Core      │            │
│     │   (Redis)   │◀────────────▶│  (Projects  │            │
│     └─────────────┘              │   1-13)     │            │
│                                  └─────────────┘            │
│                                         │                    │
└─────────────────────────────────────────┼────────────────────┘
                                          │
                                          ▼
┌─────────────────────────────────────────────────────────────┐
│                     DATA STORES                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │   Neo4j     │  │  ChromaDB   │  │  PostgreSQL │         │
│  │   (Graph)   │  │  (Vectors)  │  │  (Metadata) │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                              │
│  Data Isolation:                                             │
│  • Tenant "acme": graph_acme, chroma_acme, schema acme     │
│  • Tenant "beta": graph_beta, chroma_beta, schema beta     │
│                                                              │
└─────────────────────────────────────────────────────────────┘


OBSERVABILITY STACK
═══════════════════

┌─────────────────────────────────────────────────────────────┐
│                                                              │
│  METRICS (Prometheus + Grafana)                             │
│  ─────────────────────────────                              │
│  • memory_requests_total{tenant, method, status}            │
│  • memory_request_duration_seconds{tenant, method}          │
│  • memory_entities_count{tenant}                            │
│  • memory_cache_hit_ratio{tenant}                           │
│                                                              │
│  LOGS (ELK or Loki)                                         │
│  ─────────────────                                          │
│  {                                                          │
│    "timestamp": "2024-12-15T10:30:00Z",                     │
│    "level": "info",                                         │
│    "tenant_id": "acme",                                     │
│    "request_id": "req_abc123",                              │
│    "method": "search",                                      │
│    "latency_ms": 45,                                        │
│    "entities_returned": 5                                   │
│  }                                                          │
│                                                              │
│  TRACES (Jaeger)                                            │
│  ──────────────                                             │
│  request_id: req_abc123                                     │
│  ├── api_gateway (5ms)                                      │
│  ├── rate_limiter (1ms)                                     │
│  ├── memory_service (35ms)                                  │
│  │   ├── cache_lookup (2ms) → MISS                         │
│  │   ├── vector_search (20ms)                              │
│  │   └── graph_traverse (10ms)                             │
│  └── response (2ms)                                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘


RATE LIMITING STRATEGY
═════════════════════

┌─────────────────────────────────────────────────────────────┐
│  Tier: Free                                                  │
│  ─────────                                                  │
│  • 100 requests/minute per tenant                           │
│  • 1,000 memories maximum                                   │
│  • No priority queue                                        │
├─────────────────────────────────────────────────────────────┤
│  Tier: Pro                                                   │
│  ─────────                                                  │
│  • 1,000 requests/minute per tenant                         │
│  • 100,000 memories maximum                                 │
│  • Priority queue access                                    │
├─────────────────────────────────────────────────────────────┤
│  Tier: Enterprise                                           │
│  ──────────────                                             │
│  • Custom limits                                            │
│  • Dedicated resources                                      │
│  • SLA guarantees                                           │
└─────────────────────────────────────────────────────────────┘

Algorithm: Token Bucket
- Each tenant has a bucket with capacity C
- Tokens replenish at rate R per second
- Request consumes 1 token
- If bucket empty → 429 Too Many Requests

3. Project Specification

3.1 What You Will Build

A production-ready memory service with:

Multi-tenant data isolation
Rate limiting by tier
Comprehensive observability
Docker/K8s deployment

3.2 Functional Requirements

Tenant management: POST /admin/tenants → Create tenant
API key auth: Authorization: Bearer sk-xxx
Rate limiting: Return 429 when exceeded
Memory CRUD: Standard memory API with tenant isolation
Metrics endpoint: GET /metrics (Prometheus format)
Health check: GET /health → Service status

3.3 Non-Functional Requirements

Latency: P99 < 200ms for search
Availability: 99.9% uptime target
Throughput: 10,000 requests/second
Security: API key auth, tenant isolation

3.3 Example Usage / Output

# Create tenant (admin only)
curl -X POST https://api.memory.example/admin/tenants \
  -H "Authorization: Bearer admin_key" \
  -d '{"name": "acme_corp", "tier": "pro"}'

# Response
{
  "tenant_id": "tenant_abc123",
  "api_key": "sk-acme-xxxxx",
  "tier": "pro",
  "limits": {
    "requests_per_minute": 1000,
    "max_memories": 100000
  }
}

# Use memory API with tenant key
curl -X POST https://api.memory.example/v1/memories \
  -H "Authorization: Bearer sk-acme-xxxxx" \
  -d '{"content": "User prefers Python"}'

# Response
{
  "id": "mem_xyz789",
  "content": "User prefers Python",
  "created_at": "2024-12-15T10:30:00Z"
}

# Rate limit exceeded
curl -X POST https://api.memory.example/v1/memories \
  -H "Authorization: Bearer sk-acme-xxxxx" \
  -d '{"content": "Another memory"}'

# Response (429)
{
  "error": "rate_limit_exceeded",
  "retry_after": 30,
  "limit": 1000,
  "remaining": 0,
  "reset_at": "2024-12-15T10:31:00Z"
}

# Health check
curl https://api.memory.example/health

# Response
{
  "status": "healthy",
  "components": {
    "neo4j": "healthy",
    "chromadb": "healthy",
    "redis": "healthy"
  },
  "version": "1.2.0"
}

# Metrics (Prometheus format)
curl https://api.memory.example/metrics

# memory_requests_total{tenant="acme",method="search",status="200"} 1234
# memory_request_duration_seconds_bucket{tenant="acme",le="0.1"} 1100
# memory_entities_count{tenant="acme"} 50000

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐     ┌───────────────┐     ┌───────────────┐
│    Client     │────▶│ Load Balancer │────▶│  API Gateway  │
└───────────────┘     └───────────────┘     └───────┬───────┘
                                                    │
                                          ┌─────────┴─────────┐
                                          │                   │
                                          ▼                   ▼
                                   ┌───────────┐       ┌───────────┐
                                   │  Service  │       │  Service  │
                                   │ Instance 1│       │ Instance 2│
                                   └─────┬─────┘       └─────┬─────┘
                                         │                   │
                                         └─────────┬─────────┘
                                                   │
                         ┌─────────────────────────┼─────────────────────────┐
                         │                         │                         │
                         ▼                         ▼                         ▼
                  ┌───────────┐             ┌───────────┐             ┌───────────┐
                  │   Neo4j   │             │  ChromaDB │             │   Redis   │
                  │  Cluster  │             │           │             │  Cluster  │
                  └───────────┘             └───────────┘             └───────────┘

4.2 Key Components

Component	Responsibility	Technology
Load Balancer	Distribute traffic	nginx / AWS ALB
API Gateway	Auth, rate limit, routing	FastAPI + Redis
Memory Service	Core business logic	Python (Projects 1-13)
Cache	Response caching	Redis
Metrics	Observability	Prometheus
Logs	Event logging	Loki / ELK
Traces	Request tracing	Jaeger / OpenTelemetry

4.3 Data Models

from pydantic import BaseModel
from datetime import datetime
from typing import Literal

class Tenant(BaseModel):
    id: str
    name: str
    api_key_hash: str
    tier: Literal["free", "pro", "enterprise"]
    created_at: datetime

class TenantLimits(BaseModel):
    requests_per_minute: int
    max_memories: int
    max_graph_nodes: int

class RateLimitInfo(BaseModel):
    limit: int
    remaining: int
    reset_at: datetime

class HealthStatus(BaseModel):
    status: Literal["healthy", "degraded", "unhealthy"]
    components: dict[str, str]
    version: str

5. Implementation Guide

5.1 Development Environment Setup

mkdir production-memory && cd production-memory
python -m venv .venv && source .venv/bin/activate
pip install fastapi uvicorn redis prometheus-client structlog opentelemetry-api

5.2 Project Structure

production-memory/
├── src/
│   ├── main.py              # FastAPI app
│   ├── config.py            # Configuration
│   ├── middleware/
│   │   ├── auth.py          # Authentication
│   │   ├── rate_limit.py    # Rate limiting
│   │   └── observability.py # Logging, metrics, traces
│   ├── routers/
│   │   ├── v1/              # API v1
│   │   └── admin/           # Admin endpoints
│   ├── services/
│   │   ├── tenant.py        # Tenant management
│   │   └── memory.py        # Memory operations
│   └── models.py            # Data models
├── deploy/
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── kubernetes/
│       ├── deployment.yaml
│       └── service.yaml
├── tests/
│   ├── test_auth.py
│   ├── test_rate_limit.py
│   └── load_test.py
└── README.md

5.3 Implementation Phases

Phase 1: Multi-Tenancy (10-12h)

Goals:

Tenant management working
Data isolation enforced

Tasks:

Build tenant registration endpoint
Implement API key authentication
Add tenant context to all requests
Create tenant-isolated data stores

Checkpoint: Each tenant sees only their data.

Phase 2: Rate Limiting (8-10h)

Goals:

Token bucket rate limiting
Per-tenant limits by tier

Tasks:

Implement token bucket with Redis
Add rate limit headers to responses
Create tier-based limit configuration
Handle rate limit exceeded gracefully

Checkpoint: Rate limits enforced correctly.

Phase 3: Observability (10-12h)

Goals:

Prometheus metrics
Structured logging
Distributed tracing

Tasks:

Add Prometheus metrics middleware
Implement structured JSON logging
Add OpenTelemetry tracing
Create health check endpoint

Checkpoint: Full observability stack working.

Phase 4: Deployment (10-12h)

Goals:

Docker containerization
Kubernetes manifests
Production configuration

Tasks:

Create Dockerfile
Build docker-compose for local dev
Create K8s deployment manifests
Document production checklist

Checkpoint: Service deployable to production.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Test components	Rate limiter logic
Integration	Test full stack	Auth → rate limit → memory
Load	Test performance	1000 concurrent requests
Security	Test isolation	Cross-tenant access attempts

6.2 Critical Test Cases

Auth required: Unauthenticated requests rejected
Tenant isolation: Tenant A can’t see Tenant B’s data
Rate limiting: Limits enforced per tenant
Graceful degradation: Service handles failures

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Shared state	Cross-tenant data leak	Strict namespace isolation
Redis bottleneck	Rate limiting slow	Use Redis cluster
Missing context	Can’t debug requests	Add request_id everywhere
Memory leaks	Service crashes over time	Profile with py-spy

8. Extensions & Challenges

8.1 Beginner Extensions

Add usage dashboard per tenant
Implement webhook notifications

8.2 Intermediate Extensions

Add billing integration (Stripe)
Implement tenant SSO

8.3 Advanced Extensions

Add multi-region deployment
Implement tenant resource quotas

9. Real-World Connections

9.1 Industry Applications

OpenAI API: Multi-tenant LLM service
Pinecone: Multi-tenant vector service
Zep Cloud: Production memory service

9.2 Interview Relevance

Explain multi-tenancy strategies
Discuss rate limiting algorithms
Describe observability best practices

10. Resources

10.1 Essential Reading

“Designing Data-Intensive Applications” by Kleppmann — Ch. on Reliability
“Site Reliability Engineering” by Google — Production practices
12-Factor App — Cloud-native principles

Previous: Project 13 (Multi-Agent Shared Memory)
Next: Project 15 (Memory Benchmark Suite)

11. Self-Assessment Checklist

I can design multi-tenant data isolation
I understand token bucket rate limiting
I can implement production observability
I know deployment best practices

12. Submission / Completion Criteria

Minimum Viable Completion:

Multi-tenant isolation working
Rate limiting implemented
Basic health checks

Full Completion:

Prometheus metrics
Structured logging
Docker deployment

Excellence:

Kubernetes manifests
Load testing results
Production runbook

Project 14: Production Memory Service

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

2.4 ASCII Diagram: Production Architecture

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.3 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Models

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Multi-Tenancy (10-12h)

Phase 2: Rate Limiting (8-10h)

Phase 3: Observability (10-12h)

Phase 4: Deployment (10-12h)

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Related Projects

11. Self-Assessment Checklist

12. Submission / Completion Criteria