Project 1: Memory Event Logger + Recall Probes

Build a strict memory event log and a deterministic recall probe runner so you can measure whether memory retrieval actually works.

Quick Reference

Attribute	Value
Difficulty	Level 2
Time Estimate	Weekend
Main Programming Language	Python (Alternatives: TypeScript, Go)
Alternative Programming Languages	TypeScript, Go
Coolness Level	Level 2
Business Potential	Level 2
Prerequisites	JSON, SQLite, basic CLI usage
Key Topics	Memory taxonomy, schema design, evaluation probes

1. Learning Objectives

By completing this project, you will:

Design a strict schema that separates memory types and sensitivity.
Build a deterministic recall probe system with pass/fail scoring.
Track memory lineage and retrieval traces for auditing.
Explain why a memory was retrieved and where it was placed.

2. All Theory Needed (Per-Concept Breakdown)

Memory Event Schema and Recall Evaluation

Fundamentals Memory systems collapse without structure. A memory event schema is the contract between what you store and what you can retrieve later. It defines the non-negotiable fields: type (episodic/semantic/preference/procedural), source (user, tool, system), time, confidence, sensitivity, and optional consent. Without these fields you cannot filter or audit memory, and retrieval becomes a guessing game. A recall probe is a deterministic test that validates whether the system can retrieve a specific memory in a controlled context. It is not a benchmark for model intelligence; it is a test of your memory pipeline. A good probe defines the query, the expected memory ID or phrase, and the placement requirement (e.g., “must appear in the top-3 retrieved memories”). Together, the schema and probes are the minimal foundation for all higher-level memory improvements.

Deep Dive into the concept Schema design is about making memory operational. Every memory entry should answer: what it is, where it came from, when it happened, how reliable it is, and how sensitive it is. In agent memory, types matter because they control retrieval policy. For example, preference memory is sensitive and should only be retrieved with explicit consent, while episodic memory is less sensitive but should decay over time. That means your schema must include both type and sensitivity to allow policy enforcement. Confidence is another key field: many memories are inferred rather than stated directly, and you need a way to encode uncertainty. A common mistake is to store only text and embedding; this makes it impossible to filter or update memory later. A robust schema allows edits, merges, and deletions without breaking the audit trail.

Recall evaluation turns memory into a measurable system. A recall probe is a structured test that asks, “When a query is given, does memory X appear in the retrieved set and in the prompt?” This is not about whether the model uses the memory, but whether the pipeline delivered it. To make probes deterministic, you must freeze variables: use fixed queries, fixed retrieval parameters, and fixed temperature. You also need a definition of success (e.g., memory ID is in top-3 retrieval results and placed in the prompt). This allows you to compare memory system versions over time. If a change in schema or retrieval policy reduces probe pass rates, you know the system regressed. Probes also help reveal false positives: if a retrieval system returns irrelevant memories, you can detect it by inserting negative probes.

Evaluation should be multi-dimensional. Basic recall@k tells you whether the memory is retrieved; latency measures whether it is retrieved fast enough for interactive use. A memory system that retrieves correctly but takes two seconds per query is unusable in real agents. A complete evaluation includes recall, latency, and traceability. Traceability means you can explain why a memory was retrieved (similarity score, recency, type match) and where it was placed in the prompt. This is critical for debugging, because retrieval errors are often silent. The schema and probe system together form the “unit tests” for memory, and you will reuse this foundation in every subsequent project.

From a systems perspective, this concept must be treated as a first-class interface between data and behavior. That means you need explicit invariants (what must always be true), observability (how you know it is true), and failure signatures (how it breaks when it is not). In practice, engineers often skip this and rely on ad-hoc fixes, which creates hidden coupling between the memory subsystem and the rest of the agent stack. A better approach is to model the concept as a pipeline stage with clear inputs, outputs, and preconditions: if inputs violate the contract, the stage should fail fast rather than silently corrupt memory. This is especially important because memory errors are long-lived and compound over time. You should also define operational metrics that reveal drift early. Examples include: the percentage of memory entries that lack required metadata, the ratio of retrieved memories that are later unused by the model, or the fraction of queries that trigger a fallback route because the primary memory store is empty. These metrics are not just for dashboards; they are design constraints that force you to keep the system testable and predictable.

Another critical dimension is lifecycle management. The concept may work well at small scale but degrade as the memory grows. This is where policies and thresholds matter: you need rules for promotion, demotion, merging, or deletion that prevent the memory from becoming a landfill. The policy should be deterministic and versioned. When it changes, you should be able to replay historical inputs and measure the delta in outputs. This is the same discipline used in data engineering for schema changes and backfills, and it applies equally to memory systems. Finally, remember that memory is an interface to user trust. If the memory system is noisy, the agent feels unreliable; if it is overly strict, the agent feels forgetful. The best designs expose these trade-offs explicitly, so you can tune them according to product goals rather than guessing in the dark.

How this fits on projects This concept is the foundation for the entire sprint. It is directly applied in the schema design of Project 1 and reused in Projects 2, 3, 4, and 7.

Definitions & key terms

Memory event: A structured record of a single memory item.
Schema: A fixed set of fields required for storage and retrieval.
Recall probe: Deterministic test to check retrieval and placement.
Lineage: Links from derived memories back to sources.

Mental model diagram (ASCII)

Memory Event
+-------------------------------+
| id        | EPI-00017          |
| type      | episodic           |
| text      | "user prefers..."  |
| source    | chat               |
| ts        | 2026-01-01T...      |
| confidence| 0.82               |
| sensitivity| low               |
+-------------------------------+

Probe
query -> retrieve -> check placement -> pass/fail

How It Works (Step-by-Step)

Define a schema with required fields and validation rules.
Store each memory event with type and sensitivity.
Build a probe that includes a query and expected memory ID.
Run retrieval with fixed parameters.
Check if the expected memory appears in the top-k and is injected into the prompt.
Record pass/fail and latency.

Minimal Concrete Example

probe:
  query: "How should I respond?"
  expected_memory_id: "PRF-00005"
  required_rank: <=3
  placement_zone: "anchor"

Common Misconceptions

“If it is in the vector store, it is retrievable.” (False: indexing and policy filters can prevent retrieval.)
“Probes measure model intelligence.” (False: they measure retrieval pipeline correctness.)

Check-Your-Understanding Questions

Why do memory events need a sensitivity field?
What makes a probe deterministic?
Why is retrieval traceability important?

Check-Your-Understanding Answers

It enables safe filtering and consent policies.
Fixed query, fixed retrieval parameters, fixed placement check.
It explains why the system retrieved a memory and helps debug failures.

Real-World Applications

Customer support agents auditing whether critical preferences are retrieved.
Compliance audits of memory usage in regulated environments.

Where You’ll Apply It

In this project: §5.4 Concepts You Must Understand First and §6 Testing Strategy.
Also used in: Project 3, Project 7.

References

“A-MEM” (agentic memory schemas) - https://arxiv.org/abs/2502.12110
“AI Engineering” by Chip Huyen - Ch. 3-4 (evaluation methods)

Key Insights A memory system without a schema and probes is untestable and therefore unreliable.

Summary This concept defines how to structure memory and measure retrieval. It is the minimum foundation for every memory system that follows.

Homework/Exercises to Practice the Concept

Design a schema for four memory types with required fields.
Draft five recall probes with expected memory IDs.

Solutions to the Homework/Exercises

Include fields: type, source, timestamp, confidence, sensitivity, consent.
Probes should specify query, expected memory ID, and required rank.

3. Project Specification

3.1 What You Will Build

A CLI tool that:

Accepts memory events in a structured schema
Validates fields and rejects invalid entries
Stores events in SQLite
Runs recall probes with deterministic retrieval
Outputs pass/fail scores and latency statistics

3.2 Functional Requirements

Schema Validation: Reject entries missing required fields.
Memory Storage: Store events with IDs, timestamps, and metadata.
Probe Runner: Execute probes and report pass/fail.
Trace Logging: Record which memories were injected and why.
Reporting: Summary stats by type, sensitivity, and pass rate.

3.3 Non-Functional Requirements

Performance: P95 probe runtime < 300ms for 1k memories.
Reliability: Deterministic outputs for identical probes.
Usability: Clear CLI messages and error codes.

3.4 Example Usage / Output

$ memory-log add --type episodic --text "User prefers concise answers" --source chat
[OK] memory_id=EPI-00017 stored

$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
[PROBE] retrieved=EPI-00017 rank=2 placement=anchor latency=120ms
[RESULT] PASS

3.5 Data Formats / Schemas / Protocols

Memory event JSON (schema shape):

{
  "id": "EPI-00017",
  "type": "episodic",
  "text": "User prefers concise answers",
  "source": "chat",
  "timestamp": "2026-01-01T10:00:00Z",
  "confidence": 0.82,
  "sensitivity": "low",
  "consent": true
}

3.6 Edge Cases

Missing type or timestamp
Invalid sensitivity label
Probe expects a memory that does not exist
Duplicate memory IDs

3.7 Real World Outcome

This section is a golden reference for correctness.

3.7.1 How to Run (Copy/Paste)

$ memory-log add --type episodic --text "User prefers concise answers" --source chat
$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
$ memory-log report

3.7.2 Golden Path Demo (Deterministic)

$ memory-log probe --query "How should I answer?" --expect "EPI-00017"
[PROBE] retrieved=EPI-00017 rank=2 placement=anchor latency=120ms
[RESULT] PASS
exit_code=0

3.7.3 Failure Demo (Deterministic)

$ memory-log probe --query "How should I answer?" --expect "EPI-99999"
[PROBE] retrieved=NONE latency=95ms
[RESULT] FAIL (expected memory not retrieved)
exit_code=2

4. Solution Architecture

4.1 High-Level Design

+-------------+    +----------------+    +----------------+
| CLI Parser  | -> | Schema Validator| -> | SQLite Store   |
+-------------+    +----------------+    +----------------+
        |                         |                  |
        v                         v                  v
    Probe Runner  -------->  Retrieval Engine  ->  Reporter

4.2 Key Components

Component	Responsibility	Key Decisions
Schema Validator	Enforce required fields	Strict vs permissive validation
Store	Persist memory events	SQLite schema and indexing
Probe Runner	Execute recall probes	Deterministic retrieval rules
Reporter	Aggregate metrics	Pass/fail scoring format

4.3 Data Structures (No Full Code)

MemoryEvent:
  id: string
  type: enum
  text: string
  source: string
  timestamp: iso8601
  confidence: float
  sensitivity: enum
  consent: bool

4.4 Algorithm Overview

Key Algorithm: Probe Execution

Load probe query and expected ID.
Retrieve candidates using fixed policy.
Verify expected ID is within top-k.
Record latency and result.

Complexity Analysis:

Time: O(k) for checking top-k results
Space: O(n) for memory store size

5. Implementation Guide

5.1 Development Environment Setup

- Install SQLite and verify version
- Create a virtual environment
- Prepare a local config file for defaults (db path, top-k)

5.2 Project Structure

project-root/
├── src/
│   ├── cli/
│   ├── schema/
│   ├── store/
│   ├── probes/
│   └── report/
├── tests/
└── README.md

5.3 The Core Question You’re Answering

“What exactly counts as memory, and how do I prove the system retrieved it?”

5.4 Concepts You Must Understand First

Schema validation
- How do you enforce required fields?
- How do you prevent invalid types?
Probe determinism
- How do you fix randomness?
- How do you define pass/fail rules?

5.5 Questions to Guide Your Design

Schema Enforcement
- Should invalid memories be rejected or quarantined?
- How will you evolve the schema?
Probe Design
- What is the minimum number of probes for coverage?
- How do you detect false positives?

5.6 Thinking Exercise

Draw a pipeline showing how a memory event becomes a probe result, including all intermediate components.

5.7 The Interview Questions They’ll Ask

“Why is memory schema design critical?”
“How do you test memory retrieval correctness?”
“What fields are non-negotiable in memory records?”
“How do you handle schema evolution?”
“What is recall@k and why does it matter?”

5.8 Hints in Layers

Hint 1: Start with validation Reject invalid memory events early.

Hint 2: Use fixed top-k Keep retrieval size constant for probes.

Hint 3: Add a trace log Store a retrieval trace for each probe.

Hint 4: Build a summary report Aggregate pass rates by memory type.

5.9 Books That Will Help

Topic	Book	Chapter
Evaluation	“AI Engineering” by Chip Huyen	Ch. 3-4
Data modeling	“Designing Data-Intensive Applications” by Martin Kleppmann	Ch. 2-3

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Goals:

Define schema and validation rules
Create SQLite tables

Tasks:

Design and document the schema.
Build validation logic and error reporting.

Checkpoint: Adding an invalid memory fails with a clear error.

Phase 2: Core Functionality (6-8 hours)

Goals:

Add memory events
Run probes and record results

Tasks:

Implement event ingestion.
Implement probe runner with deterministic settings.

Checkpoint: A probe can pass/fail reliably.

Phase 3: Polish & Edge Cases (4-6 hours)

Goals:

Reports and edge cases
Error codes and CLI UX

Tasks:

Add report summaries.
Add failure modes and error codes.

Checkpoint: Report output matches expected stats.

5.11 Key Implementation Decisions

Decision	Options	Recommendation	Rationale
Schema strictness	Reject / Quarantine	Reject	Keeps memory clean and predictable
Probe evaluation	Text match / ID match	ID match	Deterministic and auditable
Storage	SQLite / JSON file	SQLite	Enables indexing and querying

6. Testing Strategy

6.1 Test Categories

Category	Purpose	Examples
Unit Tests	Validate schema rules	Missing field, invalid type
Integration Tests	Probe workflow	Ingest + probe + report
Edge Case Tests	Invalid inputs	Duplicate ID, null fields

6.2 Critical Test Cases

Missing Type: Memory without type must be rejected.
Probe with Missing Memory: Should fail with exit code 2.
Determinism: Same probe run twice yields same result.

6.3 Test Data

probe:
  query: "How should I answer?"
  expected: "EPI-00017"

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

Pitfall	Symptom	Solution
Missing validation	Garbage memory data	Enforce schema at ingestion
Non-deterministic probes	Flaky test results	Fix random seeds and retrieval params
Overly strict filters	Low recall	Relax type or recency filters

7.2 Debugging Strategies

Trace logging: Keep a retrieval trace for each probe.
Schema diffing: Compare memory records against expected schema.

7.3 Performance Traps

Storing all memory in a single table without indexing.

8. Extensions & Challenges

8.1 Beginner Extensions

Add a source field for memory provenance.
Add a simple CSV export.

8.2 Intermediate Extensions

Add sensitivity-based filters.
Add probe tags and grouped reports.

8.3 Advanced Extensions

Add a replay mode that simulates memory retrieval over time.
Add a UI dashboard for probe analytics.

9. Real-World Connections

9.1 Industry Applications

Support bots verifying that preferences are retrieved.
Enterprise assistants with audit trails.

LangChain Memory - https://python.langchain.com/docs/how_to/memory/
LlamaIndex Memory - https://docs.llamaindex.ai/en/latest/module_guides/deploying/agents/memory/

9.3 Interview Relevance

Memory schema design and evaluation probes are common topics in agent system interviews.

10. Resources

10.1 Essential Reading

“AI Engineering” by Chip Huyen - Ch. 3-4 (evaluation and metrics)
“Designing Data-Intensive Applications” by Martin Kleppmann - Ch. 2-3 (data modeling)

10.2 Video Resources

Conference talks on RAG evaluation (search within recent AI engineering talks)

10.3 Tools & Documentation

SQLite docs (schema and indexing)
LangChain Memory docs

11. Self-Assessment Checklist

11.1 Understanding

I can explain why schema design is critical for memory systems.
I understand how recall probes measure retrieval correctness.
I can explain why traceability matters.

11.2 Implementation

All functional requirements are met.
All probes are deterministic.
Edge cases are handled.

11.3 Growth

I can describe at least one improvement to the schema.
I can explain probe results in an interview.

12. Submission / Completion Criteria

Minimum Viable Completion:

Schema validation and memory storage implemented
At least 5 recall probes defined and passing
Report output includes pass rate

Full Completion:

All minimum criteria plus:
Sensitivity filtering implemented
Probe trace logging implemented

Excellence (Going Above & Beyond):

Replay mode with historical probe comparison
Dashboard or visualization of recall trends

Project 1: Memory Event Logger + Recall Probes

Quick Reference

1. Learning Objectives

2. All Theory Needed (Per-Concept Breakdown)

Memory Event Schema and Recall Evaluation

3. Project Specification

3.1 What You Will Build

3.2 Functional Requirements

3.3 Non-Functional Requirements

3.4 Example Usage / Output

3.5 Data Formats / Schemas / Protocols

3.6 Edge Cases

3.7 Real World Outcome

3.7.1 How to Run (Copy/Paste)

3.7.2 Golden Path Demo (Deterministic)

3.7.3 Failure Demo (Deterministic)

4. Solution Architecture

4.1 High-Level Design

4.2 Key Components

4.3 Data Structures (No Full Code)

4.4 Algorithm Overview

5. Implementation Guide

5.1 Development Environment Setup

5.2 Project Structure

5.3 The Core Question You’re Answering

5.4 Concepts You Must Understand First

5.5 Questions to Guide Your Design

5.6 Thinking Exercise

5.7 The Interview Questions They’ll Ask

5.8 Hints in Layers

5.9 Books That Will Help

5.10 Implementation Phases

Phase 1: Foundation (4-6 hours)

Phase 2: Core Functionality (6-8 hours)

Phase 3: Polish & Edge Cases (4-6 hours)

5.11 Key Implementation Decisions

6. Testing Strategy

6.1 Test Categories

6.2 Critical Test Cases

6.3 Test Data

7. Common Pitfalls & Debugging

7.1 Frequent Mistakes

7.2 Debugging Strategies

7.3 Performance Traps

8. Extensions & Challenges

8.1 Beginner Extensions

8.2 Intermediate Extensions

8.3 Advanced Extensions

9. Real-World Connections

9.1 Industry Applications

9.2 Related Open Source Projects

9.3 Interview Relevance

10. Resources

10.1 Essential Reading

10.2 Video Resources

10.3 Tools & Documentation

10.4 Related Projects in This Series

11. Self-Assessment Checklist

11.1 Understanding

11.2 Implementation

11.3 Growth

12. Submission / Completion Criteria