Project 15: Memory Benchmark Suite
Build a comprehensive benchmark suite that evaluates memory systems on accuracy, latency, recall, and temporal reasoning capabilities with reproducible tests and comparison reports.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 4: Expert |
| Time Estimate | 2-3 weeks (30-40 hours) |
| Language | Python |
| Prerequisites | Projects 1-14, evaluation methodology, statistics |
| Key Topics | Benchmarking, evaluation metrics, reproducibility, statistical significance, comparative analysis |
1. Learning Objectives
By completing this project, you will:
- Design benchmarks that measure memory system quality.
- Implement metrics for accuracy, recall, latency, and temporal reasoning.
- Build reproducible test harnesses.
- Create meaningful comparison reports.
- Understand statistical significance in evaluations.
2. Theoretical Foundation
2.1 Core Concepts
-
Benchmark Design: Creating representative test cases that stress different aspects of memory systems.
- Evaluation Metrics:
- Accuracy: Is the retrieved information correct?
- Recall@K: What fraction of relevant items are in top K?
- Latency: How fast is retrieval? (P50, P95, P99)
- Temporal Accuracy: Are time-based queries correct?
-
Reproducibility: Same benchmark should produce same results given same system state.
- Statistical Significance: Are differences between systems meaningful or due to chance?
2.2 Why This Matters
Without benchmarks, you can’t answer:
- “Is Graphiti better than Mem0 for my use case?”
- “Does adding graph retrieval improve accuracy?”
- “What’s the latency cost of temporal queries?”
Rigorous evaluation enables informed decisions.
2.3 Common Misconceptions
- “Just compare accuracy.” Latency matters too; 99% accuracy at 10s latency is unusable.
- “Run once and report.” Multiple runs needed for statistical confidence.
- “All benchmarks are equal.” Benchmark quality determines evaluation quality.
2.4 ASCII Diagram: Benchmark Architecture
MEMORY BENCHMARK SUITE ARCHITECTURE
══════════════════════════════════════════════════════════════
┌─────────────────────────────────────────────────────────────┐
│ BENCHMARK DATASETS │
├─────────────────────────────────────────────────────────────┤
│ │
│ CONVERSATIONAL MEMORY │
│ ──────────────────── │
│ • 1000 synthetic conversations │
│ • Ground truth: extracted facts, relationships │
│ • Temporal annotations: when facts were mentioned │
│ │
│ ENTITY RESOLUTION │
│ ───────────────── │
│ • 500 entity pairs (same/different labels) │
│ • Varying similarity levels │
│ • Cross-conversation deduplication │
│ │
│ TEMPORAL REASONING │
│ ────────────────── │
│ • 200 temporal queries with expected answers │
│ • "What was X before Y happened?" │
│ • Point-in-time reconstruction tests │
│ │
│ RETRIEVAL QUALITY │
│ ───────────────── │
│ • 300 queries with relevance judgments │
│ • Semantic, keyword, and graph traversal queries │
│ • Multiple difficulty levels │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ TEST HARNESS │
├─────────────────────────────────────────────────────────────┤
│ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ SYSTEM ADAPTER │ │
│ │ │ │
│ │ Uniform interface for different memory systems: │ │
│ │ • adapter.ingest(episodes) │ │
│ │ • adapter.search(query) │ │
│ │ • adapter.get_entity(id) │ │
│ │ • adapter.temporal_query(query, at_time) │ │
│ │ │ │
│ │ Implemented adapters: │ │
│ │ • GraphitiAdapter │ │
│ │ • Mem0Adapter │ │
│ │ • CustomMemoryAdapter (your implementation) │ │
│ └───────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌───────────────────────────────────────────────────────┐ │
│ │ TEST RUNNER │ │
│ │ │ │
│ │ For each (dataset, system): │ │
│ │ 1. Reset system state │ │
│ │ 2. Ingest benchmark data │ │
│ │ 3. Run test queries │ │
│ │ 4. Collect results + timing │ │
│ │ 5. Repeat N times for statistics │ │
│ └───────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ METRICS COLLECTOR │
├─────────────────────────────────────────────────────────────┤
│ │
│ RETRIEVAL METRICS │
│ ───────────────── │
│ • Precision@K: relevant / retrieved │
│ • Recall@K: retrieved_relevant / total_relevant │
│ • MRR: 1/rank of first relevant │
│ • NDCG@K: normalized discounted cumulative gain │
│ │
│ TEMPORAL METRICS │
│ ──────────────── │
│ • Temporal accuracy: % correct time-based queries │
│ • Point-in-time precision: correct state reconstruction │
│ • Edge validity: % edges with correct valid_from/to │
│ │
│ LATENCY METRICS │
│ ─────────────── │
│ • P50, P95, P99 latency │
│ • Ingestion throughput (episodes/second) │
│ • Query throughput (queries/second) │
│ │
│ QUALITY METRICS │
│ ─────────────── │
│ • Entity extraction F1 │
│ • Relationship extraction F1 │
│ • Entity resolution accuracy │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ REPORT GENERATOR │
├─────────────────────────────────────────────────────────────┤
│ │
│ COMPARISON TABLE │
│ ──────────────── │
│ ┌────────────────┬──────────┬──────────┬──────────┐ │
│ │ Metric │ Graphiti │ Mem0 │ Custom │ │
│ ├────────────────┼──────────┼──────────┼──────────┤ │
│ │ Recall@10 │ 0.85 │ 0.78 │ 0.82 │ │
│ │ P99 Latency │ 120ms │ 80ms │ 95ms │ │
│ │ Temporal Acc │ 0.92 │ 0.71 │ 0.88 │ │
│ │ Entity F1 │ 0.89 │ 0.85 │ 0.87 │ │
│ └────────────────┴──────────┴──────────┴──────────┘ │
│ │
│ STATISTICAL ANALYSIS │
│ ──────────────────── │
│ • 95% confidence intervals for all metrics │
│ • Paired t-tests for system comparisons │
│ • Effect size (Cohen's d) for significant differences │
│ │
│ VISUALIZATIONS │
│ ────────────── │
│ • Latency distribution histograms │
│ • Precision-recall curves │
│ • Radar charts for multi-dimensional comparison │
│ │
└─────────────────────────────────────────────────────────────┘
BENCHMARK DATASET EXAMPLE
═════════════════════════
RETRIEVAL BENCHMARK ENTRY
─────────────────────────
{
"id": "query_001",
"category": "semantic",
"difficulty": "medium",
"query": "What programming languages does Alice prefer?",
"relevant_facts": [
"fact_123", // Alice prefers Python
"fact_456", // Alice uses Python for APIs
"fact_789" // Alice mentioned liking Rust
],
"expected_entities": ["Alice", "Python", "Rust"],
"context": "Conversation from December 2024 about preferences"
}
TEMPORAL BENCHMARK ENTRY
────────────────────────
{
"id": "temporal_001",
"category": "point_in_time",
"query": "What was Alice's job title before the promotion?",
"reference_time": "2024-10-01", // Query as of this date
"expected_answer": "Software Engineer",
"ground_truth": {
"fact": "Alice has job title Senior Engineer",
"valid_from": "2024-11-15", // Promotion date
"previous_fact": "Alice has job title Software Engineer",
"previous_valid_from": "2022-03-01"
}
}
ENTITY RESOLUTION BENCHMARK ENTRY
─────────────────────────────────
{
"id": "entity_001",
"entity_a": {"name": "Alice Smith", "type": "Person", "context": "from episode 23"},
"entity_b": {"name": "A. Smith", "type": "Person", "context": "from episode 45"},
"label": "same", // or "different"
"confidence": 1.0, // Human annotation confidence
"reasoning": "Same person, abbreviated name in later conversation"
}
3. Project Specification
3.1 What You Will Build
A Python benchmark suite that:
- Provides curated test datasets
- Adapts to different memory systems
- Measures comprehensive metrics
- Generates comparison reports
3.2 Functional Requirements
- Load dataset:
suite.load_dataset("retrieval")→ Dataset - Register system:
suite.register(name, adapter) - Run benchmark:
suite.run(dataset, systems)→ Results - Compute metrics:
suite.compute_metrics(results)→ Metrics - Generate report:
suite.report(metrics)→ HTML/Markdown - Compare systems:
suite.compare(system_a, system_b)→ Comparison
3.3 Example Usage / Output
from memory_benchmark import BenchmarkSuite, GraphitiAdapter, Mem0Adapter
# Initialize suite
suite = BenchmarkSuite()
# Register systems to benchmark
suite.register("graphiti", GraphitiAdapter(config))
suite.register("mem0", Mem0Adapter(config))
suite.register("custom", CustomMemoryAdapter(my_memory))
# Load benchmark datasets
retrieval_data = suite.load_dataset("retrieval")
temporal_data = suite.load_dataset("temporal")
# Run benchmarks
results = suite.run(
datasets=[retrieval_data, temporal_data],
systems=["graphiti", "mem0", "custom"],
runs=5 # Multiple runs for statistics
)
# Compute metrics
metrics = suite.compute_metrics(results)
print(metrics.summary())
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Metric │ graphiti │ mem0 │ custom │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Recall@10 │ 0.85±0.02│ 0.78±0.03│ 0.82±0.02│
# │ MRR │ 0.72±0.03│ 0.68±0.04│ 0.70±0.03│
# │ Temporal Acc │ 0.92±0.01│ 0.71±0.02│ 0.88±0.02│
# │ P99 Latency │ 120ms │ 80ms │ 95ms │
# │ Entity F1 │ 0.89±0.02│ 0.85±0.02│ 0.87±0.02│
# └────────────────┴──────────┴──────────┴──────────┘
# Statistical comparison
comparison = suite.compare("graphiti", "mem0")
print(comparison)
# Recall@10: graphiti significantly better (p=0.003, d=0.82)
# Temporal Acc: graphiti significantly better (p<0.001, d=1.45)
# P99 Latency: mem0 significantly better (p=0.012, d=0.65)
# Generate full report
suite.report(metrics, output="benchmark_report.html")
print("Report generated: benchmark_report.html")
# Per-category breakdown
print(metrics.by_category("retrieval"))
# ┌────────────────┬──────────┬──────────┬──────────┐
# │ Query Type │ graphiti │ mem0 │ custom │
# ├────────────────┼──────────┼──────────┼──────────┤
# │ Semantic │ 0.88 │ 0.82 │ 0.85 │
# │ Keyword │ 0.75 │ 0.85 │ 0.80 │
# │ Graph-based │ 0.92 │ 0.68 │ 0.81 │
# └────────────────┴──────────┴──────────┴──────────┘
4. Solution Architecture
4.1 High-Level Design
┌───────────────┐ ┌───────────────┐
│ Datasets │────▶│ Test Runner │
└───────────────┘ └───────┬───────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Adapter 1 │ │ Adapter 2 │ │ Adapter 3 │
│ (Graphiti)│ │ (Mem0) │ │ (Custom) │
└───────────┘ └───────────┘ └───────────┘
│ │ │
└───────────────┼───────────────┘
│
▼
┌───────────────┐
│ Metrics │
│ Calculator │
└───────┬───────┘
│
▼
┌───────────────┐
│ Report │
│ Generator │
└───────────────┘
4.2 Key Components
| Component | Responsibility | Technology |
|---|---|---|
| BenchmarkSuite | Orchestrate benchmarks | Python class |
| Dataset | Store test cases and ground truth | JSON/Parquet |
| Adapter | Interface to memory system | Abstract class |
| MetricsCalculator | Compute evaluation metrics | scipy, numpy |
| ReportGenerator | Create HTML/Markdown reports | Jinja2, matplotlib |
4.3 Data Models
from pydantic import BaseModel
from typing import Literal
class TestCase(BaseModel):
id: str
category: str
difficulty: Literal["easy", "medium", "hard"]
query: str
ground_truth: dict
class BenchmarkResult(BaseModel):
system: str
test_id: str
response: dict
latency_ms: float
timestamp: str
class MetricResult(BaseModel):
metric_name: str
system: str
value: float
std_dev: float
confidence_interval: tuple[float, float]
class Comparison(BaseModel):
system_a: str
system_b: str
metric: str
p_value: float
effect_size: float
winner: str | None
5. Implementation Guide
5.1 Development Environment Setup
mkdir memory-benchmark && cd memory-benchmark
python -m venv .venv && source .venv/bin/activate
pip install numpy scipy pandas matplotlib jinja2 pydantic
5.2 Project Structure
memory-benchmark/
├── src/
│ ├── suite.py # BenchmarkSuite main class
│ ├── datasets/
│ │ ├── retrieval.json
│ │ ├── temporal.json
│ │ └── entity.json
│ ├── adapters/
│ │ ├── base.py # Abstract adapter
│ │ ├── graphiti.py
│ │ ├── mem0.py
│ │ └── custom.py
│ ├── metrics/
│ │ ├── retrieval.py # Precision, recall, MRR
│ │ ├── temporal.py # Temporal accuracy
│ │ └── statistics.py # Significance tests
│ ├── reports/
│ │ ├── generator.py
│ │ └── templates/
│ └── models.py
├── tests/
│ └── test_metrics.py
└── README.md
5.3 Implementation Phases
Phase 1: Datasets and Adapters (10-12h)
Goals:
- Curate benchmark datasets
- Build system adapter interface
Tasks:
- Create retrieval benchmark dataset (300+ queries)
- Create temporal benchmark dataset (200+ queries)
- Design adapter interface
- Implement at least 2 adapters
Checkpoint: Can run queries against two systems.
Phase 2: Metrics Calculation (10-12h)
Goals:
- Compute comprehensive metrics
- Add statistical analysis
Tasks:
- Implement retrieval metrics (P@K, R@K, MRR, NDCG)
- Implement temporal metrics
- Implement latency metrics
- Add statistical significance tests
Checkpoint: Metrics computed for all systems.
Phase 3: Reporting (8-10h)
Goals:
- Generate comparison reports
- Create visualizations
Tasks:
- Build HTML report template
- Create comparison tables
- Add visualizations (charts, graphs)
- Add export to Markdown/CSV
Checkpoint: Full benchmark report generated.
6. Testing Strategy
6.1 Test Categories
| Category | Purpose | Examples |
|---|---|---|
| Unit | Test metrics | Precision calculation |
| Integration | Test full pipeline | Dataset → report |
| Validation | Test benchmark quality | Known-result datasets |
6.2 Critical Test Cases
- Metrics correctness: Known inputs produce expected metrics
- Reproducibility: Same run produces same results
- Statistical tests: P-values computed correctly
- Report generation: All sections populated
7. Common Pitfalls & Debugging
| Pitfall | Symptom | Solution |
|---|---|---|
| Biased datasets | Misleading results | Balance difficulty levels |
| Too few runs | High variance | Run 10+ times |
| Wrong ground truth | Incorrect metrics | Manual verification |
| Ignoring variance | False comparisons | Always report std dev |
8. Extensions & Challenges
8.1 Beginner Extensions
- Add more benchmark datasets
- Implement additional adapters
8.2 Intermediate Extensions
- Add automated dataset generation
- Implement regression testing
8.3 Advanced Extensions
- Add continuous benchmarking (CI/CD)
- Implement multi-language support
9. Real-World Connections
9.1 Industry Applications
- MTEB: Massive Text Embedding Benchmark
- BEIR: Benchmarking IR datasets
- LangChain Evaluations: LLM application testing
9.2 Interview Relevance
- Explain evaluation metrics and their tradeoffs
- Discuss statistical significance in ML
- Describe benchmark design principles
10. Resources
10.1 Essential Reading
- “AI Engineering” by Chip Huyen — Ch. on Evaluation
- MTEB Paper — Embedding benchmark methodology
- “Statistical Methods” papers — Significance testing
10.2 Related Projects
- Previous: Project 14 (Production Memory Service)
- This is the capstone project of the series
11. Self-Assessment Checklist
- I can design fair benchmark datasets
- I understand evaluation metrics (P@K, R@K, MRR)
- I know how to compute statistical significance
- I can create meaningful comparison reports
12. Submission / Completion Criteria
Minimum Viable Completion:
- 2+ benchmark datasets
- 2+ system adapters
- Basic metrics (P@K, R@K)
Full Completion:
- All planned datasets
- Comprehensive metrics
- HTML report generation
Excellence:
- Statistical significance testing
- Visualization suite
- Continuous benchmarking setup