Project 1: Memory Event Logger + Recall Probes

Build a strict memory event log and a deterministic recall probe runner so you can measure whether memory retrieval actually works.

Quick Reference

Attribute Value
Difficulty Level 2
Time Estimate Weekend
Main Programming Language Python (Alternatives: TypeScript, Go)
Alternative Programming Languages TypeScript, Go
Coolness Level Level 2
Business Potential Level 2
Prerequisites JSON, SQLite, basic CLI usage
Key Topics Memory taxonomy, schema design, evaluation probes

1. Learning Objectives

By completing this project, you will:

  1. Design a strict schema that separates memory types and sensitivity.
  2. Build a deterministic recall probe system with pass/fail scoring.
  3. Track memory lineage and retrieval traces for auditing.
  4. Explain why a memory was retrieved and where it was placed.

2. All Theory Needed (Per-Concept Breakdown)

Memory Event Schema and Recall Evaluation

Fundamentals Memory systems collapse without structure. A memory event schema is the contract between what you store and what you can retrieve later. It defines the non-negotiable fields: type (episodic/semantic/preference/procedural), source (user, tool, system), time, confidence, sensitivity, and optional consent. Without these fields you cannot filter or audit memory, and retrieval becomes a guessing game. A recall probe is a deterministic test that validates whether the system can retrieve a specific memory in a controlled context. It is not a benchmark for model intelligence; it is a test of your memory pipeline. A good probe defines the query, the expected memory ID or phrase, and the placement requirement (e.g., “must appear in the top-3 retrieved memories”). Together, the schema and probes are the minimal foundation for all higher-level memory improvements.

Deep Dive into the concept Schema design is about making memory operational. Every memory entry should answer: what it is, where it came from, when it happened, how reliable it is, and how sensitive it is. In agent memory, types matter because they control retrieval policy. For example, preference memory is sensitive and should only be retrieved with explicit consent, while episodic memory is less sensitive but should decay over time. That means your schema must include both type and sensitivity to allow policy enforcement. Confidence is another key field: many memories are inferred rather than stated directly, and you need a way to encode uncertainty. A common mistake is to store only text and embedding; this makes it impossible to filter or update memory later. A robust schema allows edits, merges, and deletions without breaking the audit trail.

Recall evaluation turns memory into a measurable system. A recall probe is a structured test that asks, “When a query is given, does memory X appear in the retrieved set and in the prompt?” This is not about whether the model uses the memory, but whether the pipeline delivered it. To make probes deterministic, you must freeze variables: use fixed queries, fixed retrieval parameters, and fixed temperature. You also need a definition of success (e.g., memory ID is in top-3 retrieval results and placed in the prompt). This allows you to compare memory system versions over time. If a change in schema or retrieval policy reduces probe pass rates, you know the system regressed. Probes also help reveal false positives: if a retrieval system returns irrelevant memories, you can detect it by inserting negative probes.

Evaluation should be multi-dimensional. Basic recall@k tells you whether the memory is retrieved; latency measures whether it is retrieved fast enough for interactive use. A memory system that retrieves correctly but takes two seconds per query is unusable in real agents. A complete evaluation includes recall, latency, and traceability. Traceability means you can explain why a memory was retrieved (similarity score, recency, type match) and where it was placed in the prompt. This is critical for debugging, because retrieval errors are often silent. The schema and probe system together form the “unit tests” for memory, and you will reuse this foundation in every subsequent project.

From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.

Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.

How this fits on projects This concept is the foundation for the entire sprint. It is directly applied in the schema design of Project 1 and reused in Projects 2, 3, 4, and 7.

Definitions & key terms

  • Memory event: A structured record of a single memory item.
  • Schema: A fixed set of fields required for storage and retrieval.
  • Recall probe: Deterministic test to check retrieval and placement.
  • Lineage: Links from derived memories back to sources.

Mental model diagram (ASCII)

Memory Event
+-------------------------------+
| id        | EPI-00017          |
| type      | episodic           |
| text      | "user prefers..."  |
| source    | chat               |
| ts        | 2026-01-01T...      |
| confidence| 0.82               |
| sensitivity| low               |
+-------------------------------+

Probe
query -> retrieve -> check placement -> pass/fail

How It Works (Step-by-Step)

  1. Define a schema with required fields and validation rules.
  2. Store each memory event with type and sensitivity.
  3. Build a probe that includes a query and expected memory ID.
  4. Run retrieval with fixed parameters.
  5. Check if the expected memory appears in the top-k and is injected into the prompt.
  6. Record pass/fail and latency.

Minimal Concrete Example

probe:
  query: "How should I respond?"
  expected_memory_id: "PRF-00005"
  required_rank: <=3
  placement_zone: "anchor"

Common Misconceptions

  • “If it is in the vector store, it is retrievable.” (False: indexing and policy filters can prevent retrieval.)
  • “Probes measure model intelligence.” (False: they measure retrieval pipeline correctness.)

Check-Your-Understanding Questions

  1. Why do memory events need a sensitivity field?
  2. What makes a probe deterministic?
  3. Why is retrieval traceability important?

Check-Your-Understanding Answers

  1. It enables safe filtering and consent policies.
  2. Fixed query, fixed retrieval parameters, fixed placement check.
  3. It explains why the system retrieved a memory and helps debug failures.

Real-World Applications

  • Customer support agents auditing whether critical preferences are retrieved.
  • Compliance audits of memory usage in regulated environments.

Where You’ll Apply It

  • In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
  • Also used in: Project 3, Project 7.

References

  • “A-MEM” (agentic memory schemas) - https://arxiv.org/abs/2502.12110
  • “AI Engineering” by Chip Huyen - Ch. 3-4 (evaluation methods)

Key Insights A memory system without a schema and probes is untestable and therefore unreliable.

Summary This concept defines how to structure memory and measure retrieval. It is the minimum foundation for every memory system that follows.

Homework/Exercises to Practice the Concept

  1. Design a schema for four memory types with required fields.
  2. Draft five recall probes with expected memory IDs.

Solutions to the Homework/Exercises

  1. Include fields: type, source, timestamp, confidence, sensitivity, consent.
  2. Probes should specify query, expected memory ID, and required rank.

3. Project Specification

3.1 What You Will Build

A CLI tool that:

  • Accepts memory events in a structured schema
  • Validates fields and rejects invalid entries
  • Stores events in SQLite
  • Runs recall probes with deterministic retrieval
  • Outputs pass/fail scores and latency statistics

3.2 Functional Requirements

  1. Schema Validation: Reject entries missing required fields.
  2. Memory Storage: Store events with IDs, timestamps, and metadata.
  3. Probe Runner: Execute probes and report pass/fail.
  4. Trace Logging: Record which memories were injected and why.
  5. Reporting: Summary stats by type, sensitivity, and pass rate.

3.3 Non-Functional Requirements

  • Performance: P95 probe runtime < 300ms for 1k memories.
  • Reliability: Deterministic outputs for identical probes.
  • Usability: Clear CLI messages and error codes.

3.4 Example Usage / Output

$ memory-log add --type episodic --text "User prefers concise answers" --source chat
[OK] memory_id=EPI-00017 stored

$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
[PROBE] retrieved=EPI-00017 rank=2 placement=anchor latency=120ms
[RESULT] PASS

3.5 Data Formats / Schemas / Protocols

Memory event JSON (schema shape):

{
  "id": "EPI-00017",
  "type": "episodic",
  "text": "User prefers concise answers",
  "source": "chat",
  "timestamp": "2026-01-01T10:00:00Z",
  "confidence": 0.82,
  "sensitivity": "low",
  "consent": true
}

3.6 Edge Cases

  • Missing type or timestamp
  • Invalid sensitivity label
  • Probe expects a memory that does not exist
  • Duplicate memory IDs

3.7 Real World Outcome

This section is a golden reference for correctness.

3.7.1 How to Run (Copy/Paste)

$ memory-log add --type episodic --text "User prefers concise answers" --source chat
$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
$ memory-log report

3.7.2 Golden Path Demo (Deterministic)

$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
[PROBE] retrieved=EPI-00017 rank=2 placement=anchor latency=120ms
[RESULT] PASS
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ memory-log probe --query "How should I answer?" --expect "EPI-99999"
[PROBE] retrieved=NONE latency=95ms
[RESULT] FAIL (expected memory not retrieved)
exit_code=2

4. Solution Architecture

4.1 High-Level Design

+-------------+    +----------------+    +----------------+
| CLI Parser  | -> | Schema Validator| -> | SQLite Store   |
+-------------+    +----------------+    +----------------+
        |                         |                  |
        v                         v                  v
    Probe Runner  -------->  Retrieval Engine  ->  Reporter

4.2 Key Components

Component Responsibility Key Decisions
Schema Validator Enforce required fields Strict vs permissive validation
Store Persist memory events SQLite schema and indexing
Probe Runner Execute recall probes Deterministic retrieval rules
Reporter Aggregate metrics Pass/fail scoring format

4.3 Data Structures (No Full Code)

MemoryEvent:
  id: string
  type: enum
  text: string
  source: string
  timestamp: iso8601
  confidence: float
  sensitivity: enum
  consent: bool

4.4 Algorithm Overview

Key Algorithm: Probe Execution

  1. Load probe query and expected ID.
  2. Retrieve candidates using fixed policy.
  3. Verify expected ID is within top-k.
  4. Record latency and result.

Complexity Analysis:

  • Time: O(k) for checking top-k results
  • Space: O(n) for memory store size

5. Implementation Guide

5.1 Development Environment Setup

- Install SQLite and verify version
- Create a virtual environment
- Prepare a local config file for defaults (db path, top-k)

5.2 Project Structure

project-root/
├── src/
│   ├── cli/
│   ├── schema/
│   ├── store/
│   ├── probes/
│   └── report/
├── tests/
└── README.md

5.3 The Core Question You’re Answering

“What exactly counts as memory, and how do I prove the system retrieved it?”

5.4 Concepts You Must Understand First

  1. Schema validation
    • How do you enforce required fields?
    • How do you prevent invalid types?
  2. Probe determinism
    • How do you fix randomness?
    • How do you define pass/fail rules?

5.5 Questions to Guide Your Design

  1. Schema Enforcement
    • Should invalid memories be rejected or quarantined?
    • How will you evolve the schema?
  2. Probe Design
    • What is the minimum number of probes for coverage?
    • How do you detect false positives?

5.6 Thinking Exercise

Draw a pipeline showing how a memory event becomes a probe result, including all intermediate components.

5.7 The Interview Questions They’ll Ask

  1. “Why is memory schema design critical?”
  2. “How do you test memory retrieval correctness?”
  3. “What fields are non-negotiable in memory records?”
  4. “How do you handle schema evolution?”
  5. “What is recall@k and why does it matter?”

5.8 Hints in Layers

Hint 1: Start with validation Reject invalid memory events early.

Hint 2: Use fixed top-k Keep retrieval size constant for probes.

Hint 3: Add a trace log Store a retrieval trace for each probe.

Hint 4: Build a summary report Aggregate pass rates by memory type.

5.9 Books That Will Help

Topic Book Chapter
Evaluation “AI Engineering” by Chip Huyen Ch. 3-4
Data modeling “Designing Data-Intensive Applications” by Martin Kleppmann Ch. 2-3

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Goals:

  • Define schema and validation rules
  • Create SQLite tables

Tasks:

  1. Design and document the schema.
  2. Build validation logic and error reporting.

Checkpoint: Adding an invalid memory fails with a clear error.

Phase 2: Core Functionality (6-8 hours)

Goals:

  • Add memory events
  • Run probes and record results

Tasks:

  1. Implement event ingestion.
  2. Implement probe runner with deterministic settings.

Checkpoint: A probe can pass/fail reliably.

Phase 3: Polish & Edge Cases (4-6 hours)

Goals:

  • Reports and edge cases
  • Error codes and CLI UX

Tasks:

  1. Add report summaries.
  2. Add failure modes and error codes.

Checkpoint: Report output matches expected stats.

5.11 Key Implementation Decisions

Decision Options Recommendation Rationale
Schema strictness Reject / Quarantine Reject Keeps memory clean and predictable
Probe evaluation Text match / ID match ID match Deterministic and auditable
Storage SQLite / JSON file SQLite Enables indexing and querying

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit Tests Validate schema rules Missing field, invalid type
Integration Tests Probe workflow Ingest + probe + report
Edge Case Tests Invalid inputs Duplicate ID, null fields

6.2 Critical Test Cases

  1. Missing Type: Memory without type must be rejected.
  2. Probe with Missing Memory: Should fail with exit code 2.
  3. Determinism: Same probe run twice yields same result.

6.3 Test Data

probe:
  query: "How should I answer?"
  expected: "EPI-00017"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall Symptom Solution
Missing validation Garbage memory data Enforce schema at ingestion
Non-deterministic probes Flaky test results Fix random seeds and retrieval params
Overly strict filters Low recall Relax type or recency filters

7.2 Debugging Strategies

  • Trace logging: Keep a retrieval trace for each probe.
  • Schema diffing: Compare memory records against expected schema.

7.3 Performance Traps

  • Storing all memory in a single table without indexing.

8. Extensions & Challenges

8.1 Beginner Extensions

  • Add a source field for memory provenance.
  • Add a simple CSV export.

8.2 Intermediate Extensions

  • Add sensitivity-based filters.
  • Add probe tags and grouped reports.

8.3 Advanced Extensions

  • Add a replay mode that simulates memory retrieval over time.
  • Add a UI dashboard for probe analytics.

9. Real-World Connections

9.1 Industry Applications

  • Support bots verifying that preferences are retrieved.
  • Enterprise assistants with audit trails.
  • LangChain Memory - https://python.langchain.com/docs/how_to/memory/
  • LlamaIndex Memory - https://docs.llamaindex.ai/en/latest/module_guides/deploying/agents/memory/

9.3 Interview Relevance

  • Memory schema design and evaluation probes are common topics in agent system interviews.

10. Resources

10.1 Essential Reading

  • “AI Engineering” by Chip Huyen - Ch. 3-4 (evaluation and metrics)
  • “Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 2-3 (data modeling)

10.2 Video Resources

  • Conference talks on RAG evaluation (search within recent AI engineering talks)

10.3 Tools & Documentation

  • SQLite docs (schema and indexing)
  • LangChain Memory docs

11. Self-Assessment Checklist

11.1 Understanding

  • I can explain why schema design is critical for memory systems.
  • I understand how recall probes measure retrieval correctness.
  • I can explain why traceability matters.

11.2 Implementation

  • All functional requirements are met.
  • All probes are deterministic.
  • Edge cases are handled.

11.3 Growth

  • I can describe at least one improvement to the schema.
  • I can explain probe results in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

  • Schema validation and memory storage implemented
  • At least 5 recall probes defined and passing
  • Report output includes pass rate

Full Completion:

  • All minimum criteria plus:
  • Sensitivity filtering implemented
  • Probe trace logging implemented

Excellence (Going Above & Beyond):

  • Replay mode with historical probe comparison
  • Dashboard or visualization of recall trends