Project 15: Memory Benchmark Suite

Build a comprehensive benchmark suite that evaluates memory systems on accuracy, latency, recall, and temporal reasoning capabilities with reproducible tests and comparison reports.

Quick Reference

Attribute	Value
Difficulty	Level 4: Expert
Time Estimate	2-3 weeks (30-40 hours)
Language	Python
Prerequisites	Projects 1-14, evaluation methodology, statistics
Key Topics	Benchmarking, evaluation metrics, reproducibility, statistical significance, comparative analysis

1. Learning Objectives

By completing this project, you will:

Design benchmarks that measure memory system quality.
Implement metrics for accuracy, recall, latency, and temporal reasoning.
Build reproducible test harnesses.
Create meaningful comparison reports.
Understand statistical significance in evaluations.

2. Theoretical Foundation

2.1 Core Concepts

Benchmark Design: Creating representative test cases that stress different aspects of memory systems.
Evaluation Metrics:
- Accuracy: Is the retrieved information correct?
- Recall@K: What fraction of relevant items are in top K?
- Latency: How fast is retrieval? (P50, P95, P99)
- Temporal Accuracy: Are time-based queries correct?
Reproducibility: Same benchmark should produce same results given same system state.
Statistical Significance: Are differences between systems meaningful or due to chance?

2.2 Why This Matters

Without benchmarks, you can’t answer:

“Is Graphiti better than Mem0 for my use case?”
“Does adding graph retrieval improve accuracy?”
“What’s the latency cost of temporal queries?”

Rigorous evaluation enables informed decisions.

2.3 Common Misconceptions

“Just compare accuracy.” Latency matters too; 99% accuracy at 10s latency is unusable.
“Run once and report.” Multiple runs needed for statistical confidence.
“All benchmarks are equal.” Benchmark quality determines evaluation quality.

2.4 ASCII Diagram: Benchmark Architecture

MEMORY BENCHMARK SUITE ARCHITECTURE
══════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────┐
│                    BENCHMARK DATASETS                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CONVERSATIONAL MEMORY                                       │
│  ────────────────────                                        │
│  • 1000 synthetic conversations                             │
│  • Ground truth: extracted facts, relationships             │
│  • Temporal annotations: when facts were mentioned          │
│                                                              │
│  ENTITY RESOLUTION                                           │
│  ─────────────────                                          │
│  • 500 entity pairs (same/different labels)                 │
│  • Varying similarity levels                                │
│  • Cross-conversation deduplication                         │
│                                                              │
│  TEMPORAL REASONING                                          │
│  ──────────────────                                         │
│  • 200 temporal queries with expected answers               │
│  • "What was X before Y happened?"                          │
│  • Point-in-time reconstruction tests                       │
│                                                              │
│  RETRIEVAL QUALITY                                           │
│  ─────────────────                                          │
│  • 300 queries with relevance judgments                     │
│  • Semantic, keyword, and graph traversal queries           │
│  • Multiple difficulty levels                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    TEST HARNESS                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                 SYSTEM ADAPTER                         │  │
│  │                                                        │  │
│  │  Uniform interface for different memory systems:       │  │
│  │  • adapter.ingest(episodes)                           │  │
│  │  • adapter.search(query)                              │  │
│  │  • adapter.get_entity(id)                             │  │
│  │  • adapter.temporal_query(query, at_time)             │  │
│  │                                                        │  │
│  │  Implemented adapters:                                 │  │
│  │  • GraphitiAdapter                                    │  │
│  │  • Mem0Adapter                                        │  │
│  │  • CustomMemoryAdapter (your implementation)          │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│                            ▼                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                  TEST RUNNER                           │  │
│  │                                                        │  │
│  │  For each (dataset, system):                          │  │
│  │    1. Reset system state                              │  │
│  │    2. Ingest benchmark data                           │  │
│  │    3. Run test queries                                │  │
│  │    4. Collect results + timing                        │  │
│  │    5. Repeat N times for statistics                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   METRICS COLLECTOR                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  RETRIEVAL METRICS                                           │
│  ─────────────────                                          │
│  • Precision@K: relevant / retrieved                        │
│  • Recall@K: retrieved_relevant / total_relevant            │
│  • MRR: 1/rank of first relevant                            │
│  • NDCG@K: normalized discounted cumulative gain            │
│                                                              │
│  TEMPORAL METRICS                                            │
│  ────────────────                                           │
│  • Temporal accuracy: % correct time-based queries          │
│  • Point-in-time precision: correct state reconstruction    │
│  • Edge validity: % edges with correct valid_from/to        │
│                                                              │
│  LATENCY METRICS                                             │
│  ───────────────                                            │
│  • P50, P95, P99 latency                                    │
│  • Ingestion throughput (episodes/second)                   │
│  • Query throughput (queries/second)                        │
│                                                              │
│  QUALITY METRICS                                             │
│  ───────────────                                            │
│  • Entity extraction F1                                     │
│  • Relationship extraction F1                               │
│  • Entity resolution accuracy                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    REPORT GENERATOR                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  COMPARISON TABLE                                            │
│  ────────────────                                           │
│  ┌────────────────┬──────────┬──────────┬──────────┐       │
│  │ Metric         │ Graphiti │  Mem0    │ Custom   │       │
│  ├────────────────┼──────────┼──────────┼──────────┤       │
│  │ Recall@10      │  0.85    │  0.78    │  0.82    │       │
│  │ P99 Latency    │  120ms   │  80ms    │  95ms    │       │
│  │ Temporal Acc   │  0.92    │  0.71    │  0.88    │       │
│  │ Entity F1      │  0.89    │  0.85    │  0.87    │       │
│  └────────────────┴──────────┴──────────┴──────────┘       │
│                                                              │
│  STATISTICAL ANALYSIS                                        │
│  ────────────────────                                       │
│  • 95% confidence intervals for all metrics                 │
│  • Paired t-tests for system comparisons                    │
│  • Effect size (Cohen's d) for significant differences      │
│                                                              │
│  VISUALIZATIONS                                              │
│  ──────────────                                             │
│  • Latency distribution histograms                          │
│  • Precision-recall curves                                  │
│  • Radar charts for multi-dimensional comparison            │
│                                                              │
└─────────────────────────────────────────────────────────────┘


BENCHMARK DATASET EXAMPLE
═════════════════════════

RETRIEVAL BENCHMARK ENTRY
─────────────────────────
{
  "id": "query_001",
  "category": "semantic",
  "difficulty": "medium",
  "query": "What programming languages does Alice prefer?",
  "relevant_facts": [
    "fact_123",  // Alice prefers Python
    "fact_456",  // Alice uses Python for APIs
    "fact_789"   // Alice mentioned liking Rust
  ],
  "expected_entities": ["Alice", "Python", "Rust"],
  "context": "Conversation from December 2024 about preferences"
}

TEMPORAL BENCHMARK ENTRY
────────────────────────
{
  "id": "temporal_001",
  "category": "point_in_time",
  "query": "What was Alice's job title before the promotion?",
  "reference_time": "2024-10-01",  // Query as of this date
  "expected_answer": "Software Engineer",
  "ground_truth": {
    "fact": "Alice has job title Senior Engineer",
    "valid_from": "2024-11-15",  // Promotion date
    "previous_fact": "Alice has job title Software Engineer",
    "previous_valid_from": "2022-03-01"
  }
}

ENTITY RESOLUTION BENCHMARK ENTRY
─────────────────────────────────
{
  "id": "entity_001",
  "entity_a": {"name": "Alice Smith", "type": "Person", "context": "from episode 23"},
  "entity_b": {"name": "A. Smith", "type": "Person", "context": "from episode 45"},
  "label": "same",  // or "different"
  "confidence": 1.0,  // Human annotation confidence
  "reasoning": "Same person, abbreviated name in later conversation"
}

3. Project Specification

3.1 What You Will Build

A Python benchmark suite that:

Provides curated test datasets
Adapts to different memory systems
Measures comprehensive metrics
Generates comparison reports

3.2 Functional Requirements

Load dataset: suite.load_dataset("retrieval") → Dataset
Register system: suite.register(name, adapter)
Run benchmark: suite.run(dataset, systems) → Results
Compute metrics: suite.compute_metrics(results) → Metrics
Generate report: suite.report(metrics) → HTML/Markdown
Compare systems: suite.compare(system_a, system_b) → Comparison

3.3 Example Usage / Output

from memory_benchmark import BenchmarkSuite, GraphitiAdapter, Mem0Adapter

# Initialize suite
suite = BenchmarkSuite()

# Register systems to benchmark
suite.register("graphiti", GraphitiAdapter(config))
suite.register("mem0", Mem0Adapter(config))
suite.register("custom", CustomMemoryAdapter(my_memory))

# Load benchmark datasets
retrieval_data = suite.load_dataset("retrieval")
temporal_data = suite.load_dataset("temporal")

# Run benchmarks
results = suite.run(
    datasets=[retrieval_data, temporal_data],
    systems=["graphiti", "mem0", "custom"],
    runs=5  # Multiple runs for statistics
)

# Compute metrics
metrics = suite.compute_metrics(results)

print(metrics.summary())
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Metric │ graphiti │ mem0 │ custom │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Recall@10 │ 0.85±0.02│ 0.78±0.03│ 0.82±0.02│
# │ MRR │ 0.72±0.03│ 0.68±0.04│ 0.70±0.03│
# │ Temporal Acc │ 0.92±0.01│ 0.71±0.02│ 0.88±0.02│
# │ P99 Latency │ 120ms │ 80ms │ 95ms │
# │ Entity F1 │ 0.89±0.02│ 0.85±0.02│ 0.87±0.02│
# └────────────────┴──────────┴──────────┴──────────┘

# Statistical comparison
comparison = suite.compare("graphiti", "mem0")
print(comparison)
# Recall@10: graphiti significantly better (p=0.003, d=0.82)
# Temporal Acc: graphiti significantly better (p<0.001, d=1.45)
# P99 Latency: mem0 significantly better (p=0.012, d=0.65)

# Generate full report
suite.report(metrics, output="benchmark_report.html")
print("Report generated: benchmark_report.html")

# Per-category breakdown
print(metrics.by_category("retrieval"))
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Query Type │ graphiti │ mem0 │ custom │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Semantic │ 0.88 │ 0.82 │ 0.85 │
# │ Keyword │ 0.75 │ 0.85 │ 0.80 │
# │ Graph-based │ 0.92 │ 0.68 │ 0.81 │
# └────────────────┴──────────┴──────────┴──────────┘

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐     ┌───────────────┐
│   Datasets    │────▶│  Test Runner  │
└───────────────┘     └───────┬───────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
       ┌───────────┐   ┌───────────┐   ┌───────────┐
       │ Adapter 1 │   │ Adapter 2 │   │ Adapter 3 │
       │ (Graphiti)│   │  (Mem0)   │   │ (Custom)  │
       └───────────┘   └───────────┘   └───────────┘
              │               │               │
              └───────────────┼───────────────┘
                              │
                              ▼
                    ┌───────────────┐
                    │   Metrics     │
                    │   Calculator  │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │    Report     │
                    │   Generator   │
                    └───────────────┘

4.2 Key Components

Component	Responsibility	Technology
BenchmarkSuite	Orchestrate benchmarks	Python class
Dataset	Store test cases and ground truth	JSON/Parquet
Adapter	Interface to memory system	Abstract class
MetricsCalculator	Compute evaluation metrics	scipy, numpy
ReportGenerator	Create HTML/Markdown reports	Jinja2, matplotlib

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class TestCase(BaseModel):
    id: str
    category: str
    difficulty: Literal["easy", "medium", "hard"]
    query: str
    ground_truth: dict

class BenchmarkResult(BaseModel):
    system: str
    test_id: str
    response: dict
    latency_ms: float
    timestamp: str

class MetricResult(BaseModel):
    metric_name: str
    system: str
    value: float
    std_dev: float
    confidence_interval: tuple[float, float]

class Comparison(BaseModel):
    system_a: str
    system_b: str
    metric: str
    p_value: float
    effect_size: float
    winner: str | None

5. Implementation Guide

5.1 Development Environment Setup

mkdir memory-benchmark && cd memory-benchmark
python -m venv .venv && source .venv/bin/activate
pip install numpy scipy pandas matplotlib jinja2 pydantic

5.2 Project Structure

memory-benchmark/
├── src/
│   ├── suite.py            # BenchmarkSuite main class
│   ├── datasets/
│   │   ├── retrieval.json
│   │   ├── temporal.json
│   │   └── entity.json
│   ├── adapters/
│   │   ├── base.py         # Abstract adapter
│   │   ├── graphiti.py
│   │   ├── mem0.py
│   │   └── custom.py
│   ├── metrics/
│   │   ├── retrieval.py    # Precision, recall, MRR
│   │   ├── temporal.py     # Temporal accuracy
│   │   └── statistics.py   # Significance tests
│   ├── reports/
│   │   ├── generator.py
│   │   └── templates/
│   └── models.py
├── tests/
│   └── test_metrics.py
└── README.md

5.3 Implementation Phases

Phase 1: Datasets and Adapters (10-12h)

Goals:

Curate benchmark datasets
Build system adapter interface

Tasks:

Create retrieval benchmark dataset (300+ queries)
Create temporal benchmark dataset (200+ queries)
Design adapter interface
Implement at least 2 adapters

Checkpoint: Can run queries against two systems.

Phase 2: Metrics Calculation (10-12h)

Goals:

Compute comprehensive metrics
Add statistical analysis

Tasks:

Implement retrieval metrics (P@K, R@K, MRR, NDCG)
Implement temporal metrics
Implement latency metrics
Add statistical significance tests

Checkpoint: Metrics computed for all systems.

Phase 3: Reporting (8-10h)

Goals:

Generate comparison reports
Create visualizations

Tasks:

Build HTML report template
Create comparison tables
Add visualizations (charts, graphs)
Add export to Markdown/CSV

Checkpoint: Full benchmark report generated.

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit	Test metrics	Precision calculation
Integration	Test full pipeline	Dataset → report
Validation	Test benchmark quality	Known-result datasets

6.2 Critical Test Cases

Metrics correctness: Known inputs produce expected metrics
Reproducibility: Same run produces same results
Statistical tests: P-values computed correctly
Report generation: All sections populated

7. Common Pitfalls & Debugging

Pitfall	Symptom	Solution
Biased datasets	Misleading results	Balance difficulty levels
Too few runs	High variance	Run 10+ times
Wrong ground truth	Incorrect metrics	Manual verification
Ignoring variance	False comparisons	Always report std dev

8. Extensions & Challenges

8.1 Beginner Extensions

Add more benchmark datasets
Implement additional adapters

8.2 Intermediate Extensions

Add automated dataset generation
Implement regression testing

8.3 Advanced Extensions

Add continuous benchmarking (CI/CD)
Implement multi-language support

9. Real-World Connections

9.1 Industry Applications

MTEB: Massive Text Embedding Benchmark
BEIR: Benchmarking IR datasets
LangChain Evaluations: LLM application testing

9.2 Interview Relevance

Explain evaluation metrics and their tradeoffs
Discuss statistical significance in ML
Describe benchmark design principles

10. Resources

10.1 Essential Reading

“AI Engineering” by Chip Huyen — Ch. on Evaluation
MTEB Paper — Embedding benchmark methodology
“Statistical Methods” papers — Significance testing

Previous: Project 14 (Production Memory Service)
This is the capstone project of the series

11. Self-Assessment Checklist

I can design fair benchmark datasets
I understand evaluation metrics (P@K, R@K, MRR)
I know how to compute statistical significance
I can create meaningful comparison reports

12. Submission / Completion Criteria

Minimum Viable Completion:

2+ benchmark datasets
2+ system adapters
Basic metrics (P@K, R@K)

Full Completion:

All planned datasets
Comprehensive metrics
HTML report generation

Excellence:

Statistical significance testing
Visualization suite
Continuous benchmarking setup

Project 15: Memory Benchmark Suite

Quick Reference

1. Learning Objectives

2. Theoretical Foundation

2.1 Core Concepts

2.2 Why This Matters

2.3 Common Misconceptions

2.4 ASCII Diagram: Benchmark Architecture

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Example Usage / Output

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Models

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 Implementation Phases

Phase 1: Datasets and Adapters (10-12h)

Phase 2: Metrics Calculation (10-12h)

Phase 3: Reporting (8-10h)

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

7. Common Pitfalls & Debugging

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Related Projects

11. Self-Assessment Checklist

12. Submission / Completion Criteria