Project 15: Memory Benchmark Suite

Build a comprehensive benchmark suite that evaluates memory systems on accuracy, latency, recall, and temporal reasoning capabilities with reproducible tests and comparison reports.

Quick Reference

Attribute Value
Difficulty Level 4: Expert
Time Estimate 2-3 weeks (30-40 hours)
Language Python
Prerequisites Projects 1-14, evaluation methodology, statistics
Key Topics Benchmarking, evaluation metrics, reproducibility, statistical significance, comparative analysis

1. Learning Objectives

By completing this project, you will:

  1. Design benchmarks that measure memory system quality.
  2. Implement metrics for accuracy, recall, latency, and temporal reasoning.
  3. Build reproducible test harnesses.
  4. Create meaningful comparison reports.
  5. Understand statistical significance in evaluations.

2. Theoretical Foundation

2.1 Core Concepts

  • Benchmark Design: Creating representative test cases that stress different aspects of memory systems.

  • Evaluation Metrics:
    • Accuracy: Is the retrieved information correct?
    • Recall@K: What fraction of relevant items are in top K?
    • Latency: How fast is retrieval? (P50, P95, P99)
    • Temporal Accuracy: Are time-based queries correct?
  • Reproducibility: Same benchmark should produce same results given same system state.

  • Statistical Significance: Are differences between systems meaningful or due to chance?

2.2 Why This Matters

Without benchmarks, you can’t answer:

  • “Is Graphiti better than Mem0 for my use case?”
  • “Does adding graph retrieval improve accuracy?”
  • “What’s the latency cost of temporal queries?”

Rigorous evaluation enables informed decisions.

2.3 Common Misconceptions

  • “Just compare accuracy.” Latency matters too; 99% accuracy at 10s latency is unusable.
  • “Run once and report.” Multiple runs needed for statistical confidence.
  • “All benchmarks are equal.” Benchmark quality determines evaluation quality.

2.4 ASCII Diagram: Benchmark Architecture

MEMORY BENCHMARK SUITE ARCHITECTURE
══════════════════════════════════════════════════════════════

┌─────────────────────────────────────────────────────────────┐
│                    BENCHMARK DATASETS                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  CONVERSATIONAL MEMORY                                       │
│  ────────────────────                                        │
│  • 1000 synthetic conversations                             │
│  • Ground truth: extracted facts, relationships             │
│  • Temporal annotations: when facts were mentioned          │
│                                                              │
│  ENTITY RESOLUTION                                           │
│  ─────────────────                                          │
│  • 500 entity pairs (same/different labels)                 │
│  • Varying similarity levels                                │
│  • Cross-conversation deduplication                         │
│                                                              │
│  TEMPORAL REASONING                                          │
│  ──────────────────                                         │
│  • 200 temporal queries with expected answers               │
│  • "What was X before Y happened?"                          │
│  • Point-in-time reconstruction tests                       │
│                                                              │
│  RETRIEVAL QUALITY                                           │
│  ─────────────────                                          │
│  • 300 queries with relevance judgments                     │
│  • Semantic, keyword, and graph traversal queries           │
│  • Multiple difficulty levels                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    TEST HARNESS                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                 SYSTEM ADAPTER                         │  │
│  │                                                        │  │
│  │  Uniform interface for different memory systems:       │  │
│  │  • adapter.ingest(episodes)                           │  │
│  │  • adapter.search(query)                              │  │
│  │  • adapter.get_entity(id)                             │  │
│  │  • adapter.temporal_query(query, at_time)             │  │
│  │                                                        │  │
│  │  Implemented adapters:                                 │  │
│  │  • GraphitiAdapter                                    │  │
│  │  • Mem0Adapter                                        │  │
│  │  • CustomMemoryAdapter (your implementation)          │  │
│  └───────────────────────────────────────────────────────┘  │
│                            │                                 │
│                            ▼                                 │
│  ┌───────────────────────────────────────────────────────┐  │
│  │                  TEST RUNNER                           │  │
│  │                                                        │  │
│  │  For each (dataset, system):                          │  │
│  │    1. Reset system state                              │  │
│  │    2. Ingest benchmark data                           │  │
│  │    3. Run test queries                                │  │
│  │    4. Collect results + timing                        │  │
│  │    5. Repeat N times for statistics                   │  │
│  └───────────────────────────────────────────────────────┘  │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                   METRICS COLLECTOR                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  RETRIEVAL METRICS                                           │
│  ─────────────────                                          │
│  • Precision@K: relevant / retrieved                        │
│  • Recall@K: retrieved_relevant / total_relevant            │
│  • MRR: 1/rank of first relevant                            │
│  • NDCG@K: normalized discounted cumulative gain            │
│                                                              │
│  TEMPORAL METRICS                                            │
│  ────────────────                                           │
│  • Temporal accuracy: % correct time-based queries          │
│  • Point-in-time precision: correct state reconstruction    │
│  • Edge validity: % edges with correct valid_from/to        │
│                                                              │
│  LATENCY METRICS                                             │
│  ───────────────                                            │
│  • P50, P95, P99 latency                                    │
│  • Ingestion throughput (episodes/second)                   │
│  • Query throughput (queries/second)                        │
│                                                              │
│  QUALITY METRICS                                             │
│  ───────────────                                            │
│  • Entity extraction F1                                     │
│  • Relationship extraction F1                               │
│  • Entity resolution accuracy                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
                            │
                            ▼
┌─────────────────────────────────────────────────────────────┐
│                    REPORT GENERATOR                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  COMPARISON TABLE                                            │
│  ────────────────                                           │
│  ┌────────────────┬──────────┬──────────┬──────────┐       │
│  │ Metric         │ Graphiti │  Mem0    │ Custom   │       │
│  ├────────────────┼──────────┼──────────┼──────────┤       │
│  │ Recall@10      │  0.85    │  0.78    │  0.82    │       │
│  │ P99 Latency    │  120ms   │  80ms    │  95ms    │       │
│  │ Temporal Acc   │  0.92    │  0.71    │  0.88    │       │
│  │ Entity F1      │  0.89    │  0.85    │  0.87    │       │
│  └────────────────┴──────────┴──────────┴──────────┘       │
│                                                              │
│  STATISTICAL ANALYSIS                                        │
│  ────────────────────                                       │
│  • 95% confidence intervals for all metrics                 │
│  • Paired t-tests for system comparisons                    │
│  • Effect size (Cohen's d) for significant differences      │
│                                                              │
│  VISUALIZATIONS                                              │
│  ──────────────                                             │
│  • Latency distribution histograms                          │
│  • Precision-recall curves                                  │
│  • Radar charts for multi-dimensional comparison            │
│                                                              │
└─────────────────────────────────────────────────────────────┘


BENCHMARK DATASET EXAMPLE
═════════════════════════

RETRIEVAL BENCHMARK ENTRY
─────────────────────────
{
  "id": "query_001",
  "category": "semantic",
  "difficulty": "medium",
  "query": "What programming languages does Alice prefer?",
  "relevant_facts": [
    "fact_123",  // Alice prefers Python
    "fact_456",  // Alice uses Python for APIs
    "fact_789"   // Alice mentioned liking Rust
  ],
  "expected_entities": ["Alice", "Python", "Rust"],
  "context": "Conversation from December 2024 about preferences"
}

TEMPORAL BENCHMARK ENTRY
────────────────────────
{
  "id": "temporal_001",
  "category": "point_in_time",
  "query": "What was Alice's job title before the promotion?",
  "reference_time": "2024-10-01",  // Query as of this date
  "expected_answer": "Software Engineer",
  "ground_truth": {
    "fact": "Alice has job title Senior Engineer",
    "valid_from": "2024-11-15",  // Promotion date
    "previous_fact": "Alice has job title Software Engineer",
    "previous_valid_from": "2022-03-01"
  }
}

ENTITY RESOLUTION BENCHMARK ENTRY
─────────────────────────────────
{
  "id": "entity_001",
  "entity_a": {"name": "Alice Smith", "type": "Person", "context": "from episode 23"},
  "entity_b": {"name": "A. Smith", "type": "Person", "context": "from episode 45"},
  "label": "same",  // or "different"
  "confidence": 1.0,  // Human annotation confidence
  "reasoning": "Same person, abbreviated name in later conversation"
}

3. Project Specification

3.1 What You Will Build

A Python benchmark suite that:

  • Provides curated test datasets
  • Adapts to different memory systems
  • Measures comprehensive metrics
  • Generates comparison reports

3.2 Functional Requirements

  1. Load dataset: suite.load_dataset("retrieval") → Dataset
  2. Register system: suite.register(name, adapter)
  3. Run benchmark: suite.run(dataset, systems) → Results
  4. Compute metrics: suite.compute_metrics(results) → Metrics
  5. Generate report: suite.report(metrics) → HTML/Markdown
  6. Compare systems: suite.compare(system_a, system_b) → Comparison

3.3 Example Usage / Output

from memory_benchmark import BenchmarkSuite, GraphitiAdapter, Mem0Adapter

# Initialize suite
suite = BenchmarkSuite()

# Register systems to benchmark
suite.register("graphiti", GraphitiAdapter(config))
suite.register("mem0", Mem0Adapter(config))
suite.register("custom", CustomMemoryAdapter(my_memory))

# Load benchmark datasets
retrieval_data = suite.load_dataset("retrieval")
temporal_data = suite.load_dataset("temporal")

# Run benchmarks
results = suite.run(
    datasets=[retrieval_data, temporal_data],
    systems=["graphiti", "mem0", "custom"],
    runs=5  # Multiple runs for statistics
)

# Compute metrics
metrics = suite.compute_metrics(results)

print(metrics.summary())
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Metric         │ graphiti │   mem0   │  custom  │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Recall@10      │ 0.85±0.02│ 0.78±0.03│ 0.82±0.02│
# │ MRR            │ 0.72±0.03│ 0.68±0.04│ 0.70±0.03│
# │ Temporal Acc   │ 0.92±0.01│ 0.71±0.02│ 0.88±0.02│
# │ P99 Latency    │  120ms   │   80ms   │   95ms   │
# │ Entity F1      │ 0.89±0.02│ 0.85±0.02│ 0.87±0.02│
# └────────────────┴──────────┴──────────┴──────────┘

# Statistical comparison
comparison = suite.compare("graphiti", "mem0")
print(comparison)
# Recall@10: graphiti significantly better (p=0.003, d=0.82)
# Temporal Acc: graphiti significantly better (p<0.001, d=1.45)
# P99 Latency: mem0 significantly better (p=0.012, d=0.65)

# Generate full report
suite.report(metrics, output="benchmark_report.html")
print("Report generated: benchmark_report.html")

# Per-category breakdown
print(metrics.by_category("retrieval"))
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Query Type     │ graphiti │   mem0   │  custom  │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Semantic       │   0.88   │   0.82   │   0.85   │
# │ Keyword        │   0.75   │   0.85   │   0.80   │
# │ Graph-based    │   0.92   │   0.68   │   0.81   │
# └────────────────┴──────────┴──────────┴──────────┘

4. Solution Architecture

4.1 High-Level Design

┌───────────────┐     ┌───────────────┐
│   Datasets    │────▶│  Test Runner  │
└───────────────┘     └───────┬───────┘
                              │
              ┌───────────────┼───────────────┐
              │               │               │
              ▼               ▼               ▼
       ┌───────────┐   ┌───────────┐   ┌───────────┐
       │ Adapter 1 │   │ Adapter 2 │   │ Adapter 3 │
       │ (Graphiti)│   │  (Mem0)   │   │ (Custom)  │
       └───────────┘   └───────────┘   └───────────┘
              │               │               │
              └───────────────┼───────────────┘
                              │
                              ▼
                    ┌───────────────┐
                    │   Metrics     │
                    │   Calculator  │
                    └───────┬───────┘
                            │
                            ▼
                    ┌───────────────┐
                    │    Report     │
                    │   Generator   │
                    └───────────────┘

4.2 Key Components

Component Responsibility Technology
BenchmarkSuite Orchestrate benchmarks Python class
Dataset Store test cases and ground truth JSON/Parquet
Adapter Interface to memory system Abstract class
MetricsCalculator Compute evaluation metrics scipy, numpy
ReportGenerator Create HTML/Markdown reports Jinja2, matplotlib

4.3 Data Models

from pydantic import BaseModel
from typing import Literal

class TestCase(BaseModel):
    id: str
    category: str
    difficulty: Literal["easy", "medium", "hard"]
    query: str
    ground_truth: dict

class BenchmarkResult(BaseModel):
    system: str
    test_id: str
    response: dict
    latency_ms: float
    timestamp: str

class MetricResult(BaseModel):
    metric_name: str
    system: str
    value: float
    std_dev: float
    confidence_interval: tuple[float, float]

class Comparison(BaseModel):
    system_a: str
    system_b: str
    metric: str
    p_value: float
    effect_size: float
    winner: str | None

5. Implementation Guide

5.1 Development Environment Setup

mkdir memory-benchmark && cd memory-benchmark
python -m venv .venv && source .venv/bin/activate
pip install numpy scipy pandas matplotlib jinja2 pydantic

5.2 Project Structure

memory-benchmark/
├── src/
│   ├── suite.py            # BenchmarkSuite main class
│   ├── datasets/
│   │   ├── retrieval.json
│   │   ├── temporal.json
│   │   └── entity.json
│   ├── adapters/
│   │   ├── base.py         # Abstract adapter
│   │   ├── graphiti.py
│   │   ├── mem0.py
│   │   └── custom.py
│   ├── metrics/
│   │   ├── retrieval.py    # Precision, recall, MRR
│   │   ├── temporal.py     # Temporal accuracy
│   │   └── statistics.py   # Significance tests
│   ├── reports/
│   │   ├── generator.py
│   │   └── templates/
│   └── models.py
├── tests/
│   └── test_metrics.py
└── README.md

5.3 Implementation Phases

Phase 1: Datasets and Adapters (10-12h)

Goals:

  • Curate benchmark datasets
  • Build system adapter interface

Tasks:

  1. Create retrieval benchmark dataset (300+ queries)
  2. Create temporal benchmark dataset (200+ queries)
  3. Design adapter interface
  4. Implement at least 2 adapters

Checkpoint: Can run queries against two systems.

Phase 2: Metrics Calculation (10-12h)

Goals:

  • Compute comprehensive metrics
  • Add statistical analysis

Tasks:

  1. Implement retrieval metrics (P@K, R@K, MRR, NDCG)
  2. Implement temporal metrics
  3. Implement latency metrics
  4. Add statistical significance tests

Checkpoint: Metrics computed for all systems.

Phase 3: Reporting (8-10h)

Goals:

  • Generate comparison reports
  • Create visualizations

Tasks:

  1. Build HTML report template
  2. Create comparison tables
  3. Add visualizations (charts, graphs)
  4. Add export to Markdown/CSV

Checkpoint: Full benchmark report generated.


6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Test metrics Precision calculation
Integration Test full pipeline Dataset → report
Validation Test benchmark quality Known-result datasets

6.2 Critical Test Cases

  1. Metrics correctness: Known inputs produce expected metrics
  2. Reproducibility: Same run produces same results
  3. Statistical tests: P-values computed correctly
  4. Report generation: All sections populated

7. Common Pitfalls & Debugging

Pitfall Symptom Solution
Biased datasets Misleading results Balance difficulty levels
Too few runs High variance Run 10+ times
Wrong ground truth Incorrect metrics Manual verification
Ignoring variance False comparisons Always report std dev

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add more benchmark datasets
  • Implement additional adapters

8.2 Intermediate Extensions

  • Add automated dataset generation
  • Implement regression testing

8.3 Advanced Extensions

  • Add continuous benchmarking (CI/CD)
  • Implement multi-language support

9. Real-World Connections

9.1 Industry Applications

  • MTEB: Massive Text Embedding Benchmark
  • BEIR: Benchmarking IR datasets
  • LangChain Evaluations: LLM application testing

9.2 Interview Relevance

  • Explain evaluation metrics and their tradeoffs
  • Discuss statistical significance in ML
  • Describe benchmark design principles

10. Resources

10.1 Essential Reading

  • “AI Engineering” by Chip Huyen — Ch. on Evaluation
  • MTEB Paper — Embedding benchmark methodology
  • “Statistical Methods” papers — Significance testing
  • Previous: Project 14 (Production Memory Service)
  • This is the capstone project of the series

11. Self-Assessment Checklist

  • I can design fair benchmark datasets
  • I understand evaluation metrics (P@K, R@K, MRR)
  • I know how to compute statistical significance
  • I can create meaningful comparison reports

12. Submission / Completion Criteria

Minimum Viable Completion:

  • 2+ benchmark datasets
  • 2+ system adapters
  • Basic metrics (P@K, R@K)

Full Completion:

  • All planned datasets
  • Comprehensive metrics
  • HTML report generation

Excellence:

  • Statistical significance testing
  • Visualization suite
  • Continuous benchmarking setup