Project 8: Metadata Filtering with Vector Search

Add structured metadata filtering to vector search without breaking recall or performance.

Quick Reference

Attribute Value
Difficulty Level 3: Advanced
Time Estimate 10-16 hours
Language Python
Prerequisites Vector search basics, indexing
Key Topics filtering, hybrid queries, index design

1. Learning Objectives

By completing this project, you will:

  1. Store metadata alongside vectors.
  2. Implement boolean and range filters.
  3. Combine filter-first and filter-after strategies.
  4. Measure impact on recall and latency.
  5. Build a query planner for filters.

2. Theoretical Foundation

Filtering shrinks candidate sets but can reduce recall if applied incorrectly.


3. Project Specification

3.1 What You Will Build

A vector search system that supports metadata filters (e.g., date ranges, categories) with performance benchmarks.

3.2 Functional Requirements

  1. Metadata schema stored with vectors.
  2. Filter parser for queries.
  3. Filter strategies (pre-filter vs post-filter).
  4. Evaluation of recall and latency.
  5. Query planner to choose strategy.

3.3 Non-Functional Requirements

  • Deterministic tests for filters.
  • Clear logs for filtering decisions.
  • Configurable filter policies.

4. Solution Architecture

4.1 Components

Component Responsibility
Metadata Store Store structured fields
Filter Engine Apply filters
Vector Search Retrieve candidates
Planner Choose filter strategy

5. Implementation Guide

5.1 Project Structure

LEARN_VECTOR_DATABASES/P08-metadata-filtering/
├── src/
│   ├── metadata.py
│   ├── filters.py
│   ├── search.py
│   ├── planner.py
│   └── eval.py

5.2 Implementation Phases

Phase 1: Metadata + filters (4-6h)

  • Store metadata and parse filters.
  • Checkpoint: filters return correct subset.

Phase 2: Search integration (4-6h)

  • Apply filters with vector search.
  • Checkpoint: filtered queries return correct results.

Phase 3: Evaluation (3-5h)

  • Measure recall and latency impact.
  • Checkpoint: report shows tradeoffs.

6. Testing Strategy

6.1 Test Categories

Category Purpose Examples
Unit filters range and boolean logic
Integration search filtered top-k results
Regression planner consistent strategy choices

6.2 Critical Test Cases

  1. Filtered results respect metadata constraints.
  2. Planner chooses pre-filter for selective queries.
  3. Recall drop measured and reported.

7. Common Pitfalls & Debugging

Pitfall Symptom Fix
Over-filtering empty results relax filter thresholds
Slow queries too many candidates pre-filter selectively
Inconsistent filtering wrong results add filter unit tests

8. Extensions & Challenges

Beginner

  • Add keyword + metadata hybrid search.
  • Add filter presets.

Intermediate

  • Add filter indexes for speed.
  • Add query planner heuristics.

Advanced

  • Add cost-based planner.
  • Add multi-tenant filter policies.

9. Real-World Connections

  • Enterprise search relies on metadata filters.
  • Compliance systems need filtered retrieval.

10. Resources

  • Vector DB filter docs
  • Search system design references

11. Self-Assessment Checklist

  • I can store and query metadata filters.
  • I can combine filtering and vector search.
  • I can measure recall impacts.

12. Submission / Completion Criteria

Minimum Completion:

  • Metadata filtering with vector search

Full Completion:

  • Evaluation of filter impact

Excellence:

  • Cost-based planner
  • Filter indexes

This guide was generated from project_based_ideas/AI_AGENTS_LLM_RAG/LEARN_VECTOR_DATABASES.md.