Learn Full-Text Search: From Zero to Search Engine Architect

Goal: Deeply understand full-text search—from the fundamental data structures (inverted indexes, tries, FSTs) to ranking algorithms (TF-IDF, BM25), text analysis pipelines (tokenization, stemming, lemmatization), and distributed search architectures. You’ll learn not just how to USE search engines like Elasticsearch, Meilisearch, and PostgreSQL full-text search, but HOW they work internally by building your own implementations from scratch.

Why Full-Text Search Matters

Every time you type in Google, search your email, find a product on Amazon, or use Ctrl+F in a document, you’re using full-text search. It’s one of the most fundamental problems in computer science, and the solutions are elegant, mathematically beautiful, and deeply practical.

The Scale of the Problem

Consider this: Google indexes over 130 trillion web pages. When you type a query, it returns results in under 200 milliseconds. How is this even possible?

The naive approach—scanning every document for your search terms—would take years. Instead, search engines use a brilliant data structure called the inverted index, combined with sophisticated ranking algorithms, compression techniques, and distributed systems.

Historical Context

Timeline of Search Innovation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1945 │ Vannevar Bush proposes "Memex" - conceptual search machine
     │
1960 │ Gerard Salton develops SMART system - first vector space model
     │
1975 │ Robertson & Spärck Jones develop probabilistic retrieval (BM25 roots)
     │
1988 │ "Okapi" system introduces BM25 ranking function
     │
1997 │ Google founded - PageRank revolutionizes web search
     │
1999 │ Doug Cutting starts Lucene project
     │
2004 │ Lucene becomes Apache project
     │
2010 │ Elasticsearch launched (built on Lucene)
     │
2018 │ Meilisearch founded (Rust-based alternative)
     │
2024 │ Vector search + LLMs merge with traditional search (RAG)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What Understanding Search Unlocks

Database Internals: Inverted indexes are used everywhere—not just search
Distributed Systems: Search clusters are masterclasses in distributed computing
Algorithm Design: Ranking algorithms balance precision, recall, and performance
Data Structures: Tries, FSTs, skip lists, roaring bitmaps—all in one domain
NLP Fundamentals: Tokenization, stemming, and language processing
Compression: Real-world application of information theory

Core Concept Analysis

The Fundamental Problem

You have documents. Users have queries. How do you find matches FAST?

THE NAIVE APPROACH (Don't do this!)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Query: "quick brown fox"

┌─────────────────────────────────────────────────────────────┐
│  For each document (millions of them):                      │
│    For each word in document:                               │
│      If word matches any query term:                        │
│        Mark as potential match                              │
│                                                             │
│  Time Complexity: O(documents × avg_doc_length × query_len) │
│  For 1 million docs of 1000 words: ~1 BILLION operations    │
└─────────────────────────────────────────────────────────────┘

Result: UNUSABLY SLOW 🐢

THE INVERTED INDEX APPROACH
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PRE-PROCESSING (done once at index time):
Build a map: term → [list of documents containing term]

┌─────────────────────────────────────────────────────────────┐
│  "quick" → [doc1, doc7, doc42, doc999, ...]                │
│  "brown" → [doc1, doc3, doc7, doc15, ...]                  │
│  "fox"   → [doc1, doc7, doc88, ...]                        │
└─────────────────────────────────────────────────────────────┘

QUERY TIME:
1. Look up "quick" → get list A
2. Look up "brown" → get list B
3. Look up "fox"   → get list C
4. Intersect A ∩ B ∩ C → [doc1, doc7]

Time Complexity: O(query_terms × log(vocabulary) + intersection_cost)
For same query: ~100 operations

Result: BLAZING FAST 🚀

The Inverted Index: Heart of Every Search Engine

FORWARD INDEX vs INVERTED INDEX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

FORWARD INDEX (how documents are stored):
┌─────────────────────────────────────────────────────────┐
│ Doc1: "The quick brown fox jumps over the lazy dog"    │
│ Doc2: "The lazy cat sleeps all day"                    │
│ Doc3: "Quick foxes are brown and clever"               │
└─────────────────────────────────────────────────────────┘

INVERTED INDEX (how we search):
┌──────────────┬───────────────────────────────────────────┐
│    TERM      │           POSTINGS LIST                   │
├──────────────┼───────────────────────────────────────────┤
│ "brown"      │ → [Doc1:pos3, Doc3:pos4]                  │
│ "cat"        │ → [Doc2:pos3]                             │
│ "clever"     │ → [Doc3:pos6]                             │
│ "dog"        │ → [Doc1:pos9]                             │
│ "fox"        │ → [Doc1:pos4]                             │
│ "foxes"      │ → [Doc3:pos2]                             │
│ "jumps"      │ → [Doc1:pos5]                             │
│ "lazy"       │ → [Doc1:pos8, Doc2:pos2]                  │
│ "over"       │ → [Doc1:pos6]                             │
│ "quick"      │ → [Doc1:pos2, Doc3:pos1]                  │
│ "sleeps"     │ → [Doc2:pos4]                             │
│ "the"        │ → [Doc1:pos1,pos7, Doc2:pos1]             │
└──────────────┴───────────────────────────────────────────┘
         │                        │
         │                        └── Postings: doc IDs + positions
         └── Dictionary: sorted terms (enables binary search)

Anatomy of a Postings List

POSTINGS LIST STRUCTURE
━━━━━━━━━━━━━━━━━━━━━━━━

For term "quick":

┌─────────────────────────────────────────────────────────────────┐
│                         POSTINGS LIST                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Doc ID  │───▶│ Doc ID  │───▶│ Doc ID  │───▶│ Doc ID  │      │
│  │   1     │    │   7     │    │   42    │    │  999    │      │
│  ├─────────┤    ├─────────┤    ├─────────┤    ├─────────┤      │
│  │ TF: 2   │    │ TF: 1   │    │ TF: 5   │    │ TF: 1   │      │
│  ├─────────┤    ├─────────┤    ├─────────┤    ├─────────┤      │
│  │Pos:[2,8]│    │Pos:[15] │    │Pos:[1,3,│    │Pos:[99] │      │
│  │         │    │         │    │ 7,12,20]│    │         │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  TF = Term Frequency (how many times term appears in doc)      │
│  Pos = Positions (where in the document the term appears)      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

COMPRESSION: Delta Encoding
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original Doc IDs:    [1, 7, 42, 999]
Delta Encoded:       [1, 6, 35, 957]  ← Store differences!
                      │  │   │   │
                      │  7-1 42-7 999-42
                      │
                      First value stored as-is

Why? Smaller numbers = fewer bits = massive space savings!

The Text Analysis Pipeline

Before text can be indexed, it must be processed:

TEXT ANALYSIS PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━

Original Text: "The Quick Brown Foxes are RUNNING!"
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  1. CHARACTER FILTERING                                     │
│     Remove/replace special characters, handle encoding      │
│     Result: "The Quick Brown Foxes are RUNNING"            │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  2. TOKENIZATION                                            │
│     Split text into individual tokens                       │
│     Result: ["The", "Quick", "Brown", "Foxes", "are",      │
│              "RUNNING"]                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  3. LOWERCASING                                             │
│     Normalize case for consistent matching                  │
│     Result: ["the", "quick", "brown", "foxes", "are",      │
│              "running"]                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  4. STOP WORD REMOVAL                                       │
│     Remove common words that don't help search              │
│     Result: ["quick", "brown", "foxes", "running"]         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  5. STEMMING / LEMMATIZATION                                │
│     Reduce words to root form                               │
│                                                             │
│     STEMMING (Porter Algorithm - fast, approximate):        │
│     "foxes" → "fox", "running" → "run"                     │
│     Result: ["quick", "brown", "fox", "run"]               │
│                                                             │
│     LEMMATIZATION (dictionary lookup - accurate, slower):   │
│     "foxes" → "fox", "running" → "run", "better" → "good"  │
│     Considers part-of-speech and context                    │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  6. ADDITIONAL FILTERS (optional)                           │
│     - Synonym expansion: "quick" → ["quick", "fast"]       │
│     - N-gram generation: "fox" → ["fo", "ox", "fox"]       │
│     - Phonetic encoding: "smith" → "SM0" (Soundex)         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
              FINAL TOKENS: ["quick", "brown", "fox", "run"]
              (Ready to be added to inverted index)

Stemming Algorithms: Porter Stemmer

PORTER STEMMER RULES (Simplified)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The algorithm applies rules in steps. Example rules:

STEP 1a: Plural forms
  ─────────────────────
  SSES → SS    (caresses → caress)
  IES  → I     (ponies → poni)
  SS   → SS    (caress → caress)  [no change]
  S    → ∅     (cats → cat)

STEP 1b: Past tense / gerund
  ─────────────────────────────
  (*v*)EED → EE   (agreed → agree, where *v* = contains vowel)
  (*v*)ED  → ∅    (plastered → plaster)
  (*v*)ING → ∅    (motoring → motor)

STEP 2: Derivational suffixes
  ─────────────────────────────
  ATIONAL → ATE  (relational → relate)
  TIONAL  → TION (conditional → condition)
  IZER    → IZE  (digitizer → digitize)
  FULNESS → FUL  (hopefulness → hopeful)

Example walkthrough:
  "generalizations"
      │
      ▼ Step 1a: Remove S
  "generalization"
      │
      ▼ Step 2: ATION → ATE
  "generalize"
      │
      ▼ Step 4: Remove IZE (if long enough)
  "general"

Ranking Algorithms: TF-IDF and BM25

Finding documents is only half the battle. You need to RANK them by relevance.

TF-IDF: TERM FREQUENCY × INVERSE DOCUMENT FREQUENCY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INTUITION:
┌─────────────────────────────────────────────────────────────────┐
│  A word is important to a document if:                         │
│                                                                 │
│  1. It appears FREQUENTLY in that document (TF ↑)              │
│     "elephant" appears 10 times → probably about elephants     │
│                                                                 │
│  2. It appears RARELY across all documents (IDF ↑)             │
│     "the" appears everywhere → not useful for ranking          │
│     "elephant" is rare → very useful for ranking               │
└─────────────────────────────────────────────────────────────────┘

FORMULA:
                                        Total Documents
TF-IDF(t,d) = TF(t,d) × log( ─────────────────────────────── )
                              Documents containing term t

EXAMPLE:
━━━━━━━━
Collection: 10,000 documents
Document D: contains "elephant" 5 times, "the" 50 times
"elephant" appears in 10 documents total
"the" appears in 9,500 documents total

TF-IDF("elephant", D) = 5 × log(10000/10) = 5 × 3 = 15.0
TF-IDF("the", D)      = 50 × log(10000/9500) = 50 × 0.02 = 1.0

Despite "the" appearing 10× more, "elephant" scores 15× higher!

BM25: THE MODERN STANDARD
━━━━━━━━━━━━━━━━━━━━━━━━━━

BM25 improves on TF-IDF with two key innovations:

1. TERM FREQUENCY SATURATION
   ──────────────────────────
   TF-IDF Problem: If "elephant" appears 100 times, is the doc
   really 10× more relevant than if it appears 10 times?

   BM25 Solution: Diminishing returns via saturation function

                     TF
   Saturated TF = ─────────
                  TF + k₁

   ┌────────────────────────────────────────────────────┐
   │                                                    │
   │  Score │     ............................          │
   │   ↑    │   ..                                      │
   │        │  .                                        │
   │        │ .     ← Asymptotic limit (saturation)    │
   │        │.                                          │
   │        │                                           │
   │        └──────────────────────────────────▶ TF     │
   │                                                    │
   │  After ~5-10 occurrences, additional mentions      │
   │  barely increase the score                         │
   └────────────────────────────────────────────────────┘

2. DOCUMENT LENGTH NORMALIZATION
   ─────────────────────────────
   TF-IDF Problem: Longer documents have more term occurrences
   just by having more words—unfair advantage!

   BM25 Solution: Normalize by document length

                     doc_length
   Length Factor = ──────────────────
                   avg_doc_length

   Short docs get a boost, long docs get penalized


BM25 FULL FORMULA:
━━━━━━━━━━━━━━━━━━

                                   TF(t,d) × (k₁ + 1)
Score(t,d) = IDF(t) × ─────────────────────────────────────────────
                      TF(t,d) + k₁ × (1 - b + b × |d|/avgdl)

Where:
  k₁ = 1.2 to 2.0 (term frequency saturation parameter)
  b  = 0.75 (length normalization parameter, 0=none, 1=full)
  |d| = length of document d
  avgdl = average document length in collection

Term Dictionary: Finite State Transducers (FST)

How do you look up terms efficiently when you have billions of unique words?

THE PROBLEM: TERM LOOKUP
━━━━━━━━━━━━━━━━━━━━━━━━

You have 10 million unique terms. For each query term, you need
to find its postings list. Options:

1. Hash Table: O(1) lookup, but HUGE memory footprint
   10M terms × 50 bytes avg = 500 MB just for keys!

2. Sorted Array + Binary Search: O(log n), still large
   Need to store every term fully expanded

3. Trie: Shares prefixes, but pointer overhead is massive

4. FST (Finite State Transducer): BEST OF ALL WORLDS
   - Shares prefixes AND suffixes
   - Maps terms to values (postings file offsets)
   - Incredibly compact
   - O(term_length) lookup

FST STRUCTURE
━━━━━━━━━━━━━

Terms to index: cat, car, card, care, careful, carefully

TRIE (shares prefixes only):
─────────────────────────────

                    (root)
                      │
                      c
                      │
                      a
                     /│\
                    / │ \
                   t  r  r
                   │  │  │
                  ∅  ∅   d ─── e
                            │   │
                           ∅    f
                                │
                                u
                                │
                                l ─── l
                                │     │
                               ∅      y
                                      │
                                     ∅

FST (shares prefixes AND suffixes):
─────────────────────────────────────

                    (root)
                      │
                      c
                      │
                      a
                     /│\
                    t r  r
                    │ │  │
                    ● ●  d ─── e ─── f ─── u ─── l
                         │           │           │\
                         ●           ●           ● y
                                                   │
                                                   ●

● = accepting state (valid word ends here)

The "ful" suffix is shared between "careful" and leads to
shared "ly" suffix for "carefully"!

MEMORY SAVINGS: 38-52% smaller than trie for large vocabularies

FST AS A MAP: term → postings_offset
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each arc can have an OUTPUT value. Lucene uses this to map
terms to file offsets in the postings file:

     c/0      a/0      t/100
(S) ─────▶ (1) ─────▶ (2) ─────▶ ((3))
                       │
                       │ r/200
                       ▼
                     ((4)) ←── "car" maps to offset 200

                       │ d/50
                       ▼
                     ((5)) ←── "card" maps to offset 200+50=250

Outputs accumulate along the path!
"cat" = 0 + 0 + 100 = 100
"car" = 0 + 0 + 200 = 200
"card" = 0 + 0 + 200 + 50 = 250

Skip Lists: Fast Postings Intersection

When you search for “quick AND brown”, you need to intersect two postings lists.

POSTINGS LIST INTERSECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Query: "quick" AND "brown"

Postings for "quick": [1, 5, 12, 45, 67, 89, 102, 156, 234, 567]
Postings for "brown": [3, 5, 22, 45, 89, 234, 345, 678]

NAIVE APPROACH: Compare every pair
Time: O(n × m) where n, m are list lengths

MERGE APPROACH: Walk both sorted lists together
Time: O(n + m)

  quick: [1, 5, 12, 45, 67, 89, 102, 156, 234, 567]
          ↑
  brown: [3, 5, 22, 45, 89, 234, 345, 678]
          ↑

  1 < 3: advance quick pointer
  5 = 5: MATCH! Add to result, advance both
  12 < 22: advance quick pointer
  ...

  Result: [5, 45, 89, 234]

SKIP LISTS: DO EVEN BETTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Add "skip pointers" to jump ahead in long lists:

Postings for "the" (very common word):
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 5 │ 7 │ 8 │12 │15 │18 │22 │25 │...│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  │           │           │           │
  └───────────┴───────────┴───────────┘
       Skip pointers (every √n elements)

When intersecting with a rare term's list [50, 100, 150]:
- Looking for 50
- At position 1, skip pointer says "jump to 12"
- 12 < 50, skip says "jump to 25"
- 25 < 50, skip says "jump to ..."
- Much faster than linear scan!

Optimal skip distance: √n for list of length n

Compression: Roaring Bitmaps

Modern search engines use sophisticated compression for postings:

ROARING BITMAPS
━━━━━━━━━━━━━━━

Problem: Storing doc IDs [1, 2, 3, 1000000, 1000001, 1000002]

Naive: Store each as 32-bit int = 24 bytes
Bitmap: 1 million bits = 125 KB (wasteful!)

ROARING: Hybrid approach
─────────────────────────

Divide 32-bit space into 65,536 "containers" of 65,536 values each:
- Container 0: values 0-65535
- Container 1: values 65536-131071
- Container 15: values 983040-1048575
- ...

Each container uses the best representation:

┌─────────────────────────────────────────────────────────────┐
│  If container has < 4096 values:                           │
│    Use ARRAY CONTAINER (sorted list of 16-bit offsets)     │
│                                                             │
│  If container has >= 4096 values:                          │
│    Use BITMAP CONTAINER (8KB bitmap)                       │
│                                                             │
│  If container has runs (1,2,3,4,5):                        │
│    Use RUN CONTAINER (store start + length)                │
└─────────────────────────────────────────────────────────────┘

Our example [1, 2, 3, 1000000, 1000001, 1000002]:

Container 0 (values 0-65535):
  Array: [1, 2, 3] → 6 bytes

Container 15 (values 983040-1048575):
  Array: [16960, 16961, 16962] → 6 bytes
  (These are 1000000-983040, etc.)

Total: ~12 bytes + overhead ≈ 20 bytes
vs 125 KB for naive bitmap!

Distributed Search Architecture

ELASTICSEARCH / SOLRCLOUD ARCHITECTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                         ┌─────────────┐
                         │   Client    │
                         └──────┬──────┘
                                │
                                ▼
                    ┌───────────────────────┐
                    │   Coordinating Node   │
                    │  (Routes & Aggregates)│
                    └───────────┬───────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │   Node 1    │      │   Node 2    │      │   Node 3    │
    │             │      │             │      │             │
    │ ┌─────────┐ │      │ ┌─────────┐ │      │ ┌─────────┐ │
    │ │Shard 0P │ │      │ │Shard 1P │ │      │ │Shard 2P │ │
    │ │(primary)│ │      │ │(primary)│ │      │ │(primary)│ │
    │ └─────────┘ │      │ └─────────┘ │      │ └─────────┘ │
    │ ┌─────────┐ │      │ ┌─────────┐ │      │ ┌─────────┐ │
    │ │Shard 1R │ │      │ │Shard 2R │ │      │ │Shard 0R │ │
    │ │(replica)│ │      │ │(replica)│ │      │ │(replica)│ │
    │ └─────────┘ │      │ └─────────┘ │      │ └─────────┘ │
    └─────────────┘      └─────────────┘      └─────────────┘

P = Primary Shard (accepts writes)
R = Replica Shard (serves reads, provides redundancy)

SEARCH FLOW:
1. Query arrives at coordinating node
2. Coordinator broadcasts query to one shard per shard-group
3. Each shard executes query locally, returns top-N results
4. Coordinator merges results, re-ranks, returns final top-N


INDEXING FLOW:
                                ┌─────────────────────────┐
Document ───▶ hash(doc_id) ───▶│ shard = hash % n_shards │
                                └───────────┬─────────────┘
                                            │
                                            ▼
                              Route to correct primary shard
                                            │
                                            ▼
                              Primary indexes, then replicates
                              to all replica shards

Concept Summary Table

Concept Cluster	What You Need to Internalize
Inverted Index	The fundamental data structure: maps terms → documents. Everything else is optimization on top of this.
Text Analysis	Raw text must be tokenized, normalized, stemmed. The same pipeline runs at index AND query time.
Postings Lists	Sorted lists of doc IDs with optional positions and term frequencies. Compression is critical.
Term Dictionary	How to look up millions of terms fast: tries, FSTs, or sorted arrays with binary search.
Ranking (TF-IDF/BM25)	Term frequency matters, but with diminishing returns. Rare terms are more valuable than common ones.
Boolean Operations	AND = intersection, OR = union. Skip lists accelerate intersection of long lists.
Compression	Delta encoding, variable-byte encoding, roaring bitmaps. Smaller = faster (less I/O).
Segments & Merging	Indexes are immutable segments. Updates create new segments; background merge consolidates.
Distributed Search	Sharding partitions data; replication provides fault tolerance. Query scatter-gather pattern.
Fuzzy Matching	Levenshtein distance, n-grams, and phonetic encoding handle typos and variations.

Deep Dive Reading by Concept

This section maps each concept to specific book chapters for deeper understanding. Read these before or alongside the projects.

Foundational Concepts

Concept	Book & Chapter
Inverted Index basics	“Introduction to Information Retrieval” by Manning, Raghavan, Schütze — Ch. 1-2 (free online at Stanford NLP)
Postings compression	“Introduction to Information Retrieval” — Ch. 5: “Index Compression”
Building an index	“Introduction to Information Retrieval” — Ch. 4: “Index Construction”
Text processing	“Introduction to Information Retrieval” — Ch. 2.2: “Tokenization” and Ch. 2.3: “Normalization”

Ranking and Scoring

Concept	Book & Chapter
TF-IDF fundamentals	“Introduction to Information Retrieval” — Ch. 6: “Scoring, term weighting and the vector space model”
BM25 in depth	“Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman — Ch. 7 (free PDF available)
Probabilistic retrieval	“Information Retrieval: Implementing and Evaluating Search Engines” by Büttcher, Clarke, Cormack — Ch. 7

Data Structures

Concept	Book & Chapter
Tries and prefix trees	“Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 5.2: “Tries”
Skip lists	“Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 3.5 or see original Pugh paper
B-trees (for disk-based indexes)	“Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Ch. 6.4
Finite State Transducers	Mike McCandless’s blog series: “Using Finite State Transducers in Lucene”

Compression

Concept	Book & Chapter
Variable-byte encoding	“Introduction to Information Retrieval” — Ch. 5.3: “Variable byte codes”
Delta/Gap encoding	“Managing Gigabytes” by Witten, Moffat, Bell — Ch. 3
Roaring Bitmaps	Daniel Lemire’s paper: “Roaring Bitmaps: Implementation of an Optimized Software Library”

Distributed Systems

Concept	Book & Chapter
Distributed indexing	“Introduction to Information Retrieval” — Ch. 4.4: “Distributed indexing”
Sharding strategies	“Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 6: “Partitioning”
Replication	“Designing Data-Intensive Applications” — Ch. 5: “Replication”

Essential Reading Order

Phase 1: Foundations (Week 1-2)

“Introduction to Information Retrieval” Ch. 1-2 (inverted index basics)
“Introduction to Information Retrieval” Ch. 5 (compression)
“Introduction to Information Retrieval” Ch. 6 (TF-IDF, vector space model)

Phase 2: Implementation Details (Week 3-4)

“Search Engines: Information Retrieval in Practice” Ch. 4-5 (indexing)
“Information Retrieval: Implementing and Evaluating Search Engines” Ch. 5-6 (query processing)
“Algorithms” Ch. 5.2 (tries)

Phase 3: Production Systems (Week 5-6)

“Designing Data-Intensive Applications” Ch. 5-6 (replication, sharding)
Elasticsearch or Solr documentation (practical application)
Lucene source code exploration

Projects

The following projects will take you from understanding the basics to being able to build production-grade search systems.

Project 1: Basic Inverted Index from Scratch

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C, Go, Python
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 1: Beginner
Knowledge Area: Data Structures / Information Retrieval
Software or Tool: Custom search library
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A command-line tool that indexes a directory of text files and allows boolean queries (AND, OR, NOT) to find matching documents.

Why it teaches full-text search: This is the absolute foundation. You cannot understand Elasticsearch or Lucene without first building the core data structure yourself. You’ll feel the pain of naive approaches and appreciate why the inverted index exists.

Core challenges you’ll face:

Designing the index data structure → maps to understanding hash maps vs sorted arrays for term lookup
Tokenizing text correctly → maps to handling punctuation, case, whitespace edge cases
Implementing boolean query logic → maps to set operations on postings lists
Persisting the index to disk → maps to serialization and file format design

Key Concepts:

Inverted Index Structure: “Introduction to Information Retrieval” Ch. 1 - Manning et al.
Tokenization: “Introduction to Information Retrieval” Ch. 2.2
Boolean Retrieval: “Introduction to Information Retrieval” Ch. 1.3

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, understanding of hash maps/dictionaries

Real World Outcome

You’ll have a working search engine that can index thousands of files and answer queries instantly. Running it will feel like magic the first time:

Example Output:

$ ./minisearch index ~/Documents/books/
Indexing 1,247 files...
Found 89,432 unique terms
Index written to ./search.idx (2.3 MB)
Indexing complete in 3.2 seconds

$ ./minisearch query "python AND machine AND learning"
Found 23 matching documents:

  1. ~/Documents/books/ml_intro.txt
     "...using Python for machine learning applications..."

  2. ~/Documents/books/deep_learning_guide.txt
     "...Python libraries for machine learning include..."

  3. ~/Documents/books/data_science.txt
     "...machine learning pipelines in Python..."

Query executed in 0.003 seconds

$ ./minisearch query "rust OR golang"
Found 156 matching documents...

$ ./minisearch query "javascript NOT nodejs"
Found 42 matching documents...

The Core Question You’re Answering

“Why is searching through millions of documents nearly instant, while grep takes forever?”

Before you write any code, sit with this question. The naive approach (scan every document) works but is painfully slow. The inverted index flips the problem: instead of “for each document, check all terms”, it becomes “for each query term, look up matching documents directly.”

Concepts You Must Understand First

Stop and research these before coding:

Hash Tables vs Sorted Arrays
- What is the lookup time for each?
- When would you prefer one over the other?
- How do you handle collisions in a hash table?
- Book Reference: “Algorithms” Ch. 3.4 - Sedgewick & Wayne
Set Operations
- How do you compute intersection of two sorted lists?
- What’s the time complexity of union vs intersection?
- Why do postings lists need to be sorted?
- Book Reference: “Introduction to Information Retrieval” Ch. 1.3
File I/O and Serialization
- How do you efficiently read large files?
- What format should your index use on disk?
- Book Reference: Language-specific documentation

Questions to Guide Your Design

Before implementing, think through these:

Data Structure Choice
- Should the term dictionary be a HashMap or a BTreeMap?
- How will you store document IDs in postings lists?
- What metadata do you need per document?
Tokenization Decisions
- How do you handle “don’t” - one token or two?
- Should numbers be indexed?
- What about email addresses and URLs?
Query Parsing
- How will you parse “python AND (web OR api)”?
- Do you need operator precedence?
- How do you handle malformed queries?

Thinking Exercise

Trace an Index Operation

Before coding, manually trace indexing these two documents:

Doc 1: "The quick brown fox"
Doc 2: "The lazy brown dog"

Draw the inverted index that results. For each term, list the document IDs.

Questions while tracing:

Which terms appear in both documents?
If you search for “brown AND quick”, which document(s) match?
If you search for “brown AND lazy”, which document(s) match?
What happens if you search for “cat”?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is an inverted index and why is it faster than linear scan?”
“How would you implement AND queries on two postings lists?”
“What’s the time complexity of intersecting two sorted lists?”
“How does your index handle documents being added or deleted?”
“What’s the difference between a term dictionary and a postings list?”

Hints in Layers

Hint 1: Starting Point Start with in-memory only. Don’t worry about disk persistence until boolean queries work perfectly in memory.

Hint 2: Data Structure Use HashMap<String, Vec<u32>> where the key is a term and the value is a sorted list of document IDs.

Hint 3: Intersection Algorithm To intersect two sorted lists, use two pointers starting at the beginning. Advance the pointer pointing to the smaller value. When they’re equal, you found a match.

Hint 4: Testing Create a small test corpus of 5-10 documents with known content. Write queries where you know the expected results.

Books That Will Help

Topic	Book	Chapter
Inverted index fundamentals	“Introduction to Information Retrieval”	Ch. 1-2
Hash tables	“Algorithms” by Sedgewick	Ch. 3.4
File I/O in Rust	“The Rust Programming Language”	Ch. 12

Learning milestones:

Index builds correctly → You understand the inverted index structure
Boolean queries return correct results → You understand set operations
Index persists to disk and reloads → You understand serialization

Project 2: Text Analyzer with Stemming and Stop Words

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 1. The “Resume Gold”
Difficulty: Level 2: Intermediate
Knowledge Area: NLP / Text Processing
Software or Tool: Text processing library
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A text analysis pipeline that tokenizes text, removes stop words, and applies stemming—then visualize how it transforms documents and improves search recall.

Why it teaches full-text search: Search quality depends heavily on text normalization. Without stemming, a search for “running” won’t find documents containing “run” or “runs”. This project makes you understand why Elasticsearch has analyzers.

Core challenges you’ll face:

Implementing the Porter Stemmer → maps to understanding rule-based string transformations
Building a configurable stop word list → maps to balancing precision vs recall
Handling edge cases in tokenization → maps to real-world text messiness
Measuring the impact on search quality → maps to understanding why we do this

Key Concepts:

Tokenization: “Introduction to Information Retrieval” Ch. 2.2
Stemming Algorithms: “Introduction to Information Retrieval” Ch. 2.3
Stop Words: “Introduction to Information Retrieval” Ch. 2.2.2

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 (Basic Inverted Index)

Real World Outcome

You’ll see exactly how text analysis transforms raw text, and measure the impact on search:

Example Output:

$ ./analyzer analyze "The cats are running quickly through the gardens!"

=== TEXT ANALYSIS PIPELINE ===

Original: "The cats are running quickly through the gardens!"

Step 1 - Tokenization:
  ["The", "cats", "are", "running", "quickly", "through", "the", "gardens", "!"]

Step 2 - Lowercasing:
  ["the", "cats", "are", "running", "quickly", "through", "the", "gardens", "!"]

Step 3 - Punctuation Removal:
  ["the", "cats", "are", "running", "quickly", "through", "the", "gardens"]

Step 4 - Stop Word Removal:
  Removed: ["the", "are", "through", "the"]
  Remaining: ["cats", "running", "quickly", "gardens"]

Step 5 - Stemming (Porter):
  "cats" → "cat"
  "running" → "run"
  "quickly" → "quick"
  "gardens" → "garden"

Final Tokens: ["cat", "run", "quick", "garden"]

$ ./analyzer compare-search "run" --with-stemming --without-stemming

=== SEARCH COMPARISON ===

Query: "run"

WITHOUT Stemming:
  Matches: 12 documents
  - doc1: "...to run the program..."
  - doc2: "...we run tests..."

WITH Stemming:
  Matches: 47 documents (291% more!)
  - doc1: "...to run the program..."
  - doc2: "...we run tests..."
  - doc3: "...he was running late..."    ← NEW MATCH
  - doc4: "...the runner crossed..."     ← NEW MATCH
  - doc5: "...runs every morning..."     ← NEW MATCH

The Core Question You’re Answering

“Why does searching for ‘run’ also find documents containing ‘running’, ‘runs’, and ‘runner’?”

The answer is stemming—reducing words to their root form so that morphological variants match. But be careful: over-aggressive stemming can hurt precision (e.g., “universal” and “university” both stem to “univers”).

Concepts You Must Understand First

Stop and research these before coding:

The Porter Stemmer Algorithm
- What are the five steps of the algorithm?
- What is a “measure” of a word?
- Why does suffix order matter?
- Book Reference: Original Porter paper (1980) or “Introduction to IR” Ch. 2.3
Stop Words Trade-offs
- Why are stop words removed?
- What happens if you search for “to be or not to be”?
- When should you NOT remove stop words?
- Book Reference: “Introduction to Information Retrieval” Ch. 2.2.2
Unicode and Text Encoding
- How do you handle accented characters?
- What about CJK (Chinese, Japanese, Korean) text?
- Book Reference: “Fluent Python” Ch. 4 (for Python) or equivalent

Questions to Guide Your Design

Before implementing, think through these:

Pipeline Architecture
- Should each step be a separate function that chains?
- How do you make the pipeline configurable?
- Should you preserve the original token for debugging?
Stemmer Implementation
- Should you implement Porter from scratch or use a library?
- How do you test correctness?
- What about words that shouldn’t be stemmed?
Language Support
- Is your tokenizer English-only?
- How would you handle German compound words?
- What about right-to-left languages?

Thinking Exercise

Manual Stemming Practice

Apply the Porter Stemmer rules to these words (trace each step):

"generalizations"
"relational"
"conditional"
"engineering"
"adjustable"

Questions while tracing:

Which suffix is removed first?
Does the word meet the “measure” requirement for each rule?
What is the final stem for each word?

The Interview Questions They’ll Ask

Prepare to answer these:

“What’s the difference between stemming and lemmatization?”
“Why might you choose not to use stemming?”
“How do stop words affect search precision and recall?”
“What problems does aggressive tokenization cause?”
“How would you handle multilingual documents?”

Hints in Layers

Hint 1: Start Simple Implement tokenization first with just whitespace splitting. Add complexity incrementally.

Hint 2: Use Reference Data The NLTK library has stop word lists and a Porter stemmer implementation you can use to verify your results.

Hint 3: Porter Stemmer Don’t implement Porter from scratch on your first try. Use a library first, then implement your own and compare outputs.

Hint 4: Measure Impact Build a test corpus, run queries with and without each pipeline step, and measure precision/recall changes.

Books That Will Help

Topic	Book	Chapter
Text normalization	“Introduction to Information Retrieval”	Ch. 2
Porter Stemmer	Original Porter paper (1980)	Full paper
NLP fundamentals	“Speech and Language Processing” by Jurafsky	Ch. 2

Learning milestones:

Tokenizer handles edge cases correctly → You understand text segmentation
Stemmer produces correct outputs → You understand morphological analysis
Search recall improves measurably → You understand why analysis matters

Project 3: TF-IDF Ranking Engine

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go, Julia
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Information Retrieval / Linear Algebra
Software or Tool: Search ranking library
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A search engine that ranks results by relevance using TF-IDF scoring, showing why some documents match better than others.

Why it teaches full-text search: Boolean search gives you matching documents, but in what ORDER? TF-IDF is the foundation of relevance ranking—understanding it unlocks understanding of BM25, which powers Elasticsearch and most modern search engines.

Core challenges you’ll face:

Computing term frequencies efficiently → maps to understanding index augmentation
Calculating IDF across the corpus → maps to global vs local statistics
Ranking multi-term queries → maps to vector space model fundamentals
Handling the “IDF explosion” for rare terms → maps to practical scoring issues

Key Concepts:

TF-IDF Formula: “Introduction to Information Retrieval” Ch. 6.2
Vector Space Model: “Introduction to Information Retrieval” Ch. 6.3
Cosine Similarity: “Introduction to Information Retrieval” Ch. 6.3.3

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 (Basic Inverted Index), basic linear algebra

Real World Outcome

You’ll see exactly WHY one document ranks higher than another, with full scoring breakdowns:

Example Output:

$ ./tfidf-search query "python machine learning tutorial"

=== QUERY: "python machine learning tutorial" ===

IDF Values (across 10,000 documents):
  "python"   → IDF: 2.30 (appears in 500 docs)
  "machine"  → IDF: 3.45 (appears in 35 docs)
  "learning" → IDF: 2.89 (appears in 130 docs)
  "tutorial" → IDF: 4.12 (appears in 8 docs)   ← Most distinctive!

Top 5 Results:

#1. python_ml_tutorial.txt (Score: 18.72)
    ├── TF("python")=8   × IDF(2.30) = 18.40
    ├── TF("machine")=5  × IDF(3.45) = 17.25
    ├── TF("learning")=6 × IDF(2.89) = 17.34
    └── TF("tutorial")=3 × IDF(4.12) = 12.36

#2. ml_basics.txt (Score: 14.23)
    ├── TF("python")=2   × IDF(2.30) = 4.60
    ├── TF("machine")=12 × IDF(3.45) = 41.40  ← High TF!
    ├── TF("learning")=10× IDF(2.89) = 28.90
    └── TF("tutorial")=0 × IDF(4.12) = 0.00   ← Missing term

#3. python_intro.txt (Score: 9.87)
    ...

Note: "tutorial" has highest IDF because it's rare in the corpus.
      Documents mentioning "tutorial" get a significant boost!

The Core Question You’re Answering

“If two documents both contain my search term, which one is MORE relevant?”

TF-IDF answers this by combining two intuitions: (1) the more times a term appears in a document, the more relevant it is, and (2) the rarer a term is across all documents, the more distinctive and valuable it is for ranking.

Concepts You Must Understand First

Stop and research these before coding:

Logarithms in IDF
- Why do we use log in the IDF formula?
- What happens if we don’t use log?
- What’s the base of the logarithm?
- Book Reference: “Introduction to Information Retrieval” Ch. 6.2.1
Vector Space Model
- How is a document represented as a vector?
- What is cosine similarity and why use it?
- Why is Euclidean distance not ideal for documents?
- Book Reference: “Introduction to Information Retrieval” Ch. 6.3
Term Frequency Variants
- What is raw TF vs log TF?
- What is augmented TF?
- When should you use each?
- Book Reference: “Introduction to Information Retrieval” Ch. 6.4

Questions to Guide Your Design

Before implementing, think through these:

Storage Decisions
- Where do you store term frequencies?
- Do you precompute IDF or calculate on the fly?
- How do you handle documents of very different lengths?
Multi-term Queries
- How do you combine scores for multiple query terms?
- Should all terms be required or just some?
- How do you handle query terms not in the index?
Efficiency
- Can you score documents incrementally?
- How do you avoid scoring every document in the corpus?
- What’s your top-K extraction strategy?

Thinking Exercise

Manual TF-IDF Calculation

Given this tiny corpus:

Doc1: "machine learning machine learning algorithms"
Doc2: "deep learning neural networks"
Doc3: "machine learning for beginners"

Calculate the TF-IDF score for the query “machine learning” for each document.

Questions while calculating:

What is TF(“machine”) for each document?
What is IDF(“machine”) and IDF(“learning”)?
Which document scores highest and why?

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain TF-IDF. Why the log in IDF?”
“What are the limitations of TF-IDF?”
“How does document length affect TF-IDF scores?”
“What is the vector space model?”
“How would you efficiently find the top-10 results from a million documents?”

Hints in Layers

Hint 1: Precompute What You Can IDF values don’t change unless the corpus changes. Compute them once and store them.

Hint 2: Augment Your Index Store term frequencies in your postings list alongside document IDs: (doc_id, term_frequency).

Hint 3: Top-K Efficiency Use a min-heap of size K. Only keep the K highest-scoring documents as you scan.

Hint 4: Score Decomposition For debugging, output the score contribution of each term separately.

Books That Will Help

Topic	Book	Chapter
TF-IDF fundamentals	“Introduction to Information Retrieval”	Ch. 6
Vector space model	“Introduction to Information Retrieval”	Ch. 6.3
Math foundations	“Linear Algebra Done Right” by Axler	Ch. 1-3

Learning milestones:

Scores match manual calculation → You understand the TF-IDF formula
Rankings feel intuitive → You understand the intuition behind TF-IDF
You can explain why doc A ranks above doc B → You truly get it

Project 4: BM25 Scoring Implementation

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Go, Python, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Information Retrieval / Probabilistic Models
Software or Tool: Production search library
Main Book: “Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman

What you’ll build: A BM25 scoring engine that demonstrates the improvements over TF-IDF—term frequency saturation and document length normalization—with tunable parameters.

Why it teaches full-text search: BM25 is what Elasticsearch, Lucene, and most modern search engines actually use. Understanding the k1 and b parameters gives you power to tune search quality in production systems.

Core challenges you’ll face:

Implementing term frequency saturation → maps to understanding asymptotic behavior
Computing average document length → maps to corpus statistics management
Tuning k1 and b parameters → maps to understanding their effects on ranking
Comparing BM25 vs TF-IDF empirically → maps to evaluation methodology

Key Concepts:

BM25 Formula: “Search Engines: Information Retrieval in Practice” Ch. 7.1
Probabilistic Retrieval: “Introduction to Information Retrieval” Ch. 11
Parameter Tuning: Elastic blog: “Practical BM25”

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (TF-IDF Ranking Engine)

Real World Outcome

You’ll see exactly how BM25 differs from TF-IDF and be able to tune parameters:

Example Output:

$ ./bm25-search query "python programming" --show-comparison

=== BM25 vs TF-IDF Comparison ===

Corpus Stats:
  Total documents: 10,000
  Average doc length: 450 tokens

BM25 Parameters: k1=1.2, b=0.75

Query: "python programming"

Document: python_guide.txt
  Length: 2,000 tokens (4.4x average - LONG document)
  TF("python"): 50 occurrences

  TF-IDF Score: 115.0
    └── Problem: Long doc has high TF, unfair advantage!

  BM25 Score: 8.72
    ├── TF saturation: 50 / (50 + 1.2) = 0.98 (saturated!)
    └── Length penalty: 1 - 0.75 + 0.75 × (2000/450) = 2.58

Document: python_tips.txt
  Length: 200 tokens (0.44x average - SHORT document)
  TF("python"): 8 occurrences

  TF-IDF Score: 18.4
    └── Lower because fewer occurrences

  BM25 Score: 7.89
    ├── TF saturation: 8 / (8 + 1.2) = 0.87
    └── Length BOOST: 1 - 0.75 + 0.75 × (200/450) = 0.58

Result: BM25 ranks the shorter, more focused document higher!

$ ./bm25-search tune --corpus ./test-data --queries ./test-queries

=== PARAMETER TUNING ===

Testing k1 values: [0.5, 1.0, 1.2, 1.5, 2.0]
Testing b values: [0.0, 0.25, 0.5, 0.75, 1.0]

Best parameters found:
  k1 = 1.2 (Mean Average Precision: 0.723)
  b  = 0.75 (Mean Average Precision: 0.723)

Parameter sensitivity analysis:
  k1: Low sensitivity (±0.02 MAP change per 0.5 k1 change)
  b:  High sensitivity (±0.08 MAP change per 0.25 b change)

The Core Question You’re Answering

“Why does Elasticsearch use BM25 instead of TF-IDF, and what do those k1 and b parameters actually control?”

BM25 fixes two TF-IDF problems: (1) TF grows linearly forever, but a word appearing 100 times isn’t 10x more relevant than appearing 10 times, and (2) longer documents naturally have higher TF, giving them an unfair advantage.

Concepts You Must Understand First

Stop and research these before coding:

Term Frequency Saturation
- What is the saturation curve for TF/(TF + k1)?
- What happens as k1 → 0 or k1 → ∞?
- How does k1 affect short vs long queries?
- Book Reference: “Practical BM25” blog series by Elastic
Length Normalization
- Why is document length normalization important?
- What does b=0 mean? What about b=1?
- When would you want less length normalization?
- Book Reference: “Search Engines: Information Retrieval in Practice” Ch. 7
Probabilistic Retrieval Foundations
- Where does BM25 come from theoretically?
- What is the Robertson-Spärck Jones framework?
- Book Reference: “Introduction to Information Retrieval” Ch. 11

Questions to Guide Your Design

Before implementing, think through these:

Parameter Storage
- Are k1 and b per-field or global?
- How do you handle different fields with different characteristics?
- Should parameters be tunable at query time?
Corpus Statistics
- Where do you store average document length?
- What happens when you add/remove documents?
- Do you recompute on every update?
Evaluation
- How do you know BM25 is “better” than TF-IDF?
- What test queries and relevance judgments do you need?
- How do you compute Mean Average Precision?

Thinking Exercise

Parameter Effect Analysis

Given these BM25 parameters and document:

k1 = 1.5
b = 0.75
avgdl = 500
Document length = 1000 (2x average)
TF = 10

Calculate the BM25 term score step by step.

Now recalculate with:

Same document but length = 250 (0.5x average)

Questions while calculating:

How much does length normalization change the score?
What if b = 0 (no length normalization)?
At what TF does saturation become significant?

The Interview Questions They’ll Ask

Prepare to answer these:

“What are the k1 and b parameters in BM25 and what do they control?”
“Why does BM25 use term frequency saturation?”
“When would you increase or decrease the b parameter?”
“How do you tune BM25 parameters for a specific corpus?”
“What’s the relationship between BM25 and TF-IDF?”

Hints in Layers

Hint 1: Start with TF-IDF Build on your TF-IDF implementation. BM25 is a modification, not a rewrite.

Hint 2: Compute avgdl Once Store average document length when building the index. Update incrementally if possible.

Hint 3: Visualize Saturation Plot the TF saturation curve for different k1 values. It builds intuition.

Hint 4: Use TREC Data The TREC conference provides standard test collections with relevance judgments for evaluation.

Books That Will Help

Topic	Book	Chapter
BM25 derivation	“Search Engines: Information Retrieval in Practice”	Ch. 7
Probabilistic IR	“Introduction to Information Retrieval”	Ch. 11
Evaluation metrics	“Introduction to Information Retrieval”	Ch. 8

Learning milestones:

BM25 formula implemented correctly → You understand the math
You can explain what k1 and b do → You have intuition
You can tune parameters on a test corpus → You’re ready for production

Project 5: Positional Index and Phrase Queries

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 3: Advanced
Knowledge Area: Data Structures / Information Retrieval
Software or Tool: Search engine library
Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: An inverted index that stores word positions, enabling phrase queries (“machine learning”), proximity queries (machine NEAR learning), and highlighting of matching text.

Why it teaches full-text search: Basic inverted indexes only tell you which documents contain a word—not WHERE in the document. Positional indexes unlock phrase search, which is essential for precise queries. This is how Google knows “New York” is different from documents containing “new” and “York” separately.

Core challenges you’ll face:

Storing positions efficiently → maps to understanding variable-length encoding
Implementing phrase matching → maps to positional intersection algorithms
Handling near/proximity queries → maps to slop and distance calculations
Highlighting matched text → maps to returning position metadata to UI

Key Concepts:

Positional Indexes: “Introduction to Information Retrieval” Ch. 2.4
Phrase Queries: “Introduction to Information Retrieval” Ch. 2.4.2
Position Compression: “Introduction to Information Retrieval” Ch. 5.3

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1 (Basic Inverted Index), Project 4 (BM25)

Real World Outcome

You’ll be able to search for exact phrases and see exactly where they appear:

Example Output:

$ ./phrase-search index ~/Documents/books/

Indexing with positions enabled...
1,247 files indexed
89,432 unique terms
Average positions per term: 4.7
Index size: 8.9 MB (with positions) vs 2.3 MB (without)

$ ./phrase-search query '"machine learning"'

=== PHRASE QUERY: "machine learning" ===

Found in 47 documents (exact phrase match):

#1. ml_intro.txt (Score: 12.4)
    Position: offset 234, line 15
    Context: "...the field of [machine learning] has grown..."

#2. deep_learning.txt (Score: 11.2)
    Position: offset 89, line 5
    Context: "...[machine learning] algorithms can..."

$ ./phrase-search query "machine NEAR/3 learning"

=== PROXIMITY QUERY: machine within 3 words of learning ===

Found in 89 documents:

#1. ai_basics.txt
    "machine-based learning" (distance: 1)
    "machine and deep learning" (distance: 2)

#2. tutorial.txt
    "learning about machine" (distance: 2, reversed)

The Core Question You’re Answering

“How does Google know that searching for ‘machine learning’ shouldn’t return every document that has ‘machine’ somewhere and ‘learning’ somewhere else?”

The answer is positional indexing. By storing WHERE each term appears in each document, you can verify that terms appear consecutively (for phrases) or within a specified distance (for proximity queries).

Concepts You Must Understand First

Stop and research these before coding:

Positional Postings Structure
- What does a positional posting look like?
- How much larger is a positional index vs non-positional?
- When are positions worth the space cost?
- Book Reference: “Introduction to Information Retrieval” Ch. 2.4
Phrase Matching Algorithm
- How do you check if positions are consecutive?
- What’s the time complexity of phrase matching?
- How do you handle multi-word phrases efficiently?
- Book Reference: “Introduction to Information Retrieval” Ch. 2.4.2
Variable-Length Integer Encoding
- Why use variable-length encoding for positions?
- What is delta encoding for positions?
- How much space does this save?
- Book Reference: “Introduction to Information Retrieval” Ch. 5.3

Questions to Guide Your Design

Before implementing, think through these:

Data Structure Design
- How do you represent positions: linked list, array, or something else?
- Do you store absolute positions or deltas?
- What about positions across different fields?
Query Processing
- How do you parse phrase queries (with quotes)?
- How do you handle NEAR/n queries?
- What about mixed boolean and phrase queries?
Memory and Disk
- Positional indexes are much larger—how do you handle memory pressure?
- Should you compress positions differently than doc IDs?

Thinking Exercise

Manual Phrase Matching

Given these positional postings:

"machine" in Doc1: positions [5, 23, 89]
"learning" in Doc1: positions [6, 45, 90]

Find all positions where “machine learning” appears as a phrase.

Questions while tracing:

Which position pairs satisfy pos_learning = pos_machine + 1?
What about “learning machine” (reversed)?
What about “machine … learning” with distance ≤ 3?

The Interview Questions They’ll Ask

Prepare to answer these:

“How do positional indexes differ from regular inverted indexes?”
“What’s the time complexity of phrase query matching?”
“How much larger is a positional index, typically?”
“How would you implement proximity/NEAR queries?”
“How do you handle phrase queries that span field boundaries?”

Hints in Layers

Hint 1: Start Simple First implement phrases for single-document testing before scaling to a corpus.

Hint 2: Position Representation Store positions as: HashMap<Term, HashMap<DocId, Vec<Position>>>.

Hint 3: Phrase Algorithm For phrase “A B C”: find docs containing all three, then for each doc, find position triplets where pos_B = pos_A + 1 AND pos_C = pos_B + 1.

Hint 4: Delta Encoding Store position deltas: [5, 23, 89] → [5, 18, 66]. Smaller numbers compress better.

Books That Will Help

Topic	Book	Chapter
Positional indexes	“Introduction to Information Retrieval”	Ch. 2.4
Position compression	“Introduction to Information Retrieval”	Ch. 5.3
Phrase queries	“Information Retrieval: Implementing and Evaluating”	Ch. 6

Learning milestones:

Phrase queries work correctly → You understand positional matching
Proximity queries work → You understand distance calculations
Highlighting shows exact positions → You understand full position propagation

Project 6: Index Compression with Variable-Byte and Delta Encoding

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: C
Alternative Programming Languages: Rust, Go, Zig
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Compression / Information Theory
Software or Tool: Compression library
Main Book: “Managing Gigabytes” by Witten, Moffat, Bell

What you’ll build: A compression layer for your inverted index that reduces size by 50-80% while maintaining fast decompression, using variable-byte encoding, delta encoding, and optionally roaring bitmaps.

Why it teaches full-text search: Real search engines don’t store raw integers—they’d run out of memory! Understanding compression is understanding how Lucene, Elasticsearch, and PostgreSQL GIN indexes achieve their performance. Compression isn’t just about space—smaller data means faster I/O.

Core challenges you’ll face:

Implementing variable-byte encoding → maps to bit manipulation and byte-level operations
Applying delta encoding to sorted lists → maps to understanding why sorted data compresses well
Benchmarking compression vs decompression speed → maps to understanding real-world trade-offs
Integrating with your index → maps to API design for transparent compression

Key Concepts:

Variable-Byte Encoding: “Introduction to Information Retrieval” Ch. 5.3.1
Delta/Gap Encoding: “Introduction to Information Retrieval” Ch. 5.3
Gamma and Golomb Codes: “Managing Gigabytes” Ch. 3

Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 1, understanding of binary representations

Real World Outcome

You’ll see dramatic space savings and understand the speed trade-offs:

Example Output:

$ ./index-compress analyze ./my_index

=== INDEX COMPRESSION ANALYSIS ===

Original index: 45.2 MB
Vocabulary: 125,000 terms
Total postings: 8.3 million

Compression Results:

Method 1: Variable-Byte Encoding
  Compressed size: 23.1 MB (48.9% reduction)
  Encode speed: 45 MB/s
  Decode speed: 380 MB/s

Method 2: Delta + Variable-Byte
  Compressed size: 15.8 MB (65.1% reduction)
  Encode speed: 42 MB/s
  Decode speed: 350 MB/s

Method 3: Delta + Simple-9
  Compressed size: 12.3 MB (72.8% reduction)
  Encode speed: 38 MB/s
  Decode speed: 520 MB/s  ← Faster decode!

Method 4: Roaring Bitmaps
  Compressed size: 11.9 MB (73.7% reduction)
  Encode speed: 52 MB/s
  Decode speed: 890 MB/s  ← Fastest!
  (Best for intersection operations)

Sample postings list compression:
  Term: "the" (appears in 95% of docs)
  Original: 450 KB
  Delta+VByte: 28 KB (93.8% reduction!)

  Term: "xylophone" (appears in 3 docs)
  Original: 12 bytes
  Delta+VByte: 8 bytes (33% reduction)

$ ./index-compress benchmark

Query: "machine learning"
  Uncompressed index: 2.3 ms (disk read: 1.8 ms)
  Compressed index:   1.1 ms (disk read: 0.4 ms, decompress: 0.3 ms)

Result: Compression makes queries FASTER due to less I/O!

The Core Question You’re Answering

“How can search engines store billions of document IDs without running out of memory?”

The answer is compression. By exploiting the sorted nature of postings lists (using delta encoding) and the fact that small numbers are more common (using variable-length encoding), you can reduce index size by 70-90%.

Concepts You Must Understand First

Stop and research these before coding:

Variable-Byte (VByte) Encoding
- How do you encode an integer using continuation bits?
- What’s the maximum overhead for small vs large numbers?
- Why use the high bit as a continuation flag?
- Book Reference: “Introduction to Information Retrieval” Ch. 5.3.1
Delta/Gap Encoding
- Why does delta encoding help sorted lists?
- What’s the distribution of gap sizes in real postings?
- When does delta encoding hurt?
- Book Reference: “Introduction to Information Retrieval” Ch. 5.3
Block-Based Compression
- What are Simple-9 and PForDelta?
- Why compress in blocks rather than one-at-a-time?
- How does SIMD acceleration help?
- Book Reference: “Managing Gigabytes” Ch. 3 or Lemire’s papers

Questions to Guide Your Design

Before implementing, think through these:

API Design
- Should compression be transparent to the index user?
- How do you handle random access to compressed data?
- Do you compress in memory, on disk, or both?
Block Size
- What’s the trade-off for different block sizes?
- How do you handle the last partial block?
- What metadata do you need per block?
Benchmarking
- How do you measure compression ratio fairly?
- What’s the right metric: space, encode speed, or decode speed?
- How do you test with realistic data?

Thinking Exercise

Manual VByte Encoding

Encode these integers using variable-byte encoding (7 bits of data per byte, high bit = continuation):

5, 127, 128, 16384, 2097152

Questions while encoding:

How many bytes does each number require?
What’s the pattern for numbers crossing byte boundaries?
What’s the maximum number encodable in 4 bytes?

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain variable-byte encoding and when you’d use it.”
“Why does delta encoding help compress postings lists?”
“What’s the trade-off between compression ratio and decompression speed?”
“How do roaring bitmaps work and when are they useful?”
“How would you handle compressed random access?”

Hints in Layers

Hint 1: Start with VByte Implement variable-byte encoding first. It’s simpler than block-based methods.

Hint 2: Test Incrementally Encode and decode single integers before moving to arrays.

Hint 3: Use Benchmarks Create benchmark data with known characteristics (uniform, skewed, etc.).

Hint 4: Consider Libraries After building your own, compare against streamvbyte or roaring libraries.

Books That Will Help

Topic	Book	Chapter
Variable-byte encoding	“Introduction to Information Retrieval”	Ch. 5.3
Advanced compression	“Managing Gigabytes”	Ch. 3
Roaring bitmaps	Lemire paper	“Roaring Bitmaps”

Learning milestones:

VByte encoding/decoding works → You understand variable-length encoding
Delta + VByte compresses significantly → You understand gap encoding
Compressed queries are faster → You understand the I/O trade-off

Project 7: Fuzzy Search with Levenshtein Automata

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C++, Go, Java
Coolness Level: Level 5: Pure Magic
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 4: Expert
Knowledge Area: Automata Theory / String Algorithms
Software or Tool: Fuzzy search library
Main Book: “Flexible Pattern Matching in Strings” by Navarro & Raffinot

What you’ll build: A fuzzy search engine that finds matches with spelling errors using Levenshtein distance, efficiently via Levenshtein automata that integrate with your FST term dictionary.

Why it teaches full-text search: Typo tolerance is essential for good search UX. The naive approach (compute edit distance to every term) is O(n) where n is vocabulary size—too slow! Levenshtein automata let you do fuzzy matching in O(m) where m is query length.

Core challenges you’ll face:

Understanding Levenshtein distance → maps to dynamic programming foundations
Building a Levenshtein automaton → maps to finite automata construction
Integrating with FST → maps to automaton intersection algorithms
Ranking fuzzy matches → maps to combining edit distance with relevance

Key Concepts:

Levenshtein Distance: Classic dynamic programming algorithm
Levenshtein Automata: Schulz & Mihov paper (2002)
FST Intersection: Lucene’s FuzzyQuery implementation

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 1, Project 6, automata theory basics

Real World Outcome

Your search engine will gracefully handle typos:

Example Output:

$ ./fuzzy-search query "elefant" --max-edits 2

=== FUZZY QUERY: "elefant" (max 2 edits) ===

Automaton constructed: 847 states, 2,341 transitions
FST traversal: visited 1,247 of 125,000 terms (1.0%)

Matches found (sorted by edit distance, then score):

Edit Distance 1:
  "elephant" → 45 documents (Score: 12.3)
    └── 1 substitution: a→e at position 3

Edit Distance 2:
  "elegant" → 23 documents (Score: 8.7)
    └── 1 substitution + 1 deletion
  "elephants" → 38 documents (Score: 10.1)
    └── 1 substitution + 1 insertion

Query time: 0.8 ms (vs 340 ms for naive approach)

$ ./fuzzy-search query "progamming" --auto-correct

=== AUTO-CORRECT SUGGESTION ===

Did you mean: "programming"? (1 edit: insert 'r')

Proceeding with corrected query...
Found 1,247 documents for "programming"

The Core Question You’re Answering

“How can a search engine instantly find ‘elephant’ when I type ‘elefant’, without checking every word in its vocabulary?”

Levenshtein automata are the answer. Instead of computing edit distance to each of 100,000 terms, you build a finite automaton that accepts all strings within edit distance N of your query, then intersect it with your term dictionary in a single traversal.

Concepts You Must Understand First

Stop and research these before coding:

Levenshtein Distance
- What operations does edit distance count?
- What’s the dynamic programming algorithm?
- What’s the time/space complexity?
- Book Reference: “Introduction to Algorithms” or any algorithms textbook
Finite Automata (NFAs and DFAs)
- What’s the difference between NFA and DFA?
- How do you convert NFA to DFA?
- What is automaton intersection?
- Book Reference: “Introduction to the Theory of Computation” by Sipser
Levenshtein Automata Construction
- How do you build an automaton that accepts all strings within distance N?
- What’s the state structure?
- Why is this more efficient than pairwise comparison?
- Book Reference: Schulz & Mihov (2002) paper

Questions to Guide Your Design

Before implementing, think through these:

Automaton Size
- How does automaton size grow with query length and edit distance?
- Should you limit max edit distance?
- How do you handle very long queries?
Integration Strategy
- How do you traverse FST and automaton simultaneously?
- What’s the intersection algorithm?
- How do you collect matching terms?
Ranking Fuzzy Matches
- Should exact matches always rank first?
- How do you combine edit distance with BM25 score?
- What weight does each edit add?

Thinking Exercise

Manual Edit Distance Calculation

Calculate the Levenshtein distance between:

“kitten” and “sitting”
“algorithm” and “altruistic”

Fill in the DP table for the first pair.

Questions while calculating:

Which cells represent substitutions vs insertions vs deletions?
What’s the minimum sequence of operations?
How would an automaton represent “all strings within distance 2 of kitten”?

The Interview Questions They’ll Ask

Prepare to answer these:

“Explain Levenshtein distance and the DP algorithm.”
“Why are Levenshtein automata more efficient than pairwise comparison?”
“How would you implement typo tolerance in a search engine?”
“What’s the relationship between edit distance and search quality?”
“How do you balance fuzzy matching with relevance ranking?”

Hints in Layers

Hint 1: Start with DP Implement the basic Levenshtein DP algorithm first. Understand it deeply.

Hint 2: Use Existing Libraries The fst crate in Rust has Levenshtein automata built in. Study its API.

Hint 3: Limit Edit Distance Edit distance 2 catches most typos. Beyond that, you get too many false positives.

Hint 4: Study Lucene Lucene’s FuzzyQuery is a masterclass implementation. Read the source.

Books That Will Help

Topic	Book	Chapter
Edit distance DP	“Introduction to Algorithms” (CLRS)	Ch. 15
Automata theory	“Introduction to Theory of Computation”	Ch. 1-2
Fuzzy matching	Schulz & Mihov paper	“Fast String Correction”

Learning milestones:

DP edit distance works → You understand the basic algorithm
Automaton traversal works → You understand NFA/DFA
FST intersection finds matches → You’ve implemented production-quality fuzzy search

Project 8: Build a Trie and FST from Scratch

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: C, C++, Go
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 1. The “Resume Gold”
Difficulty: Level 4: Expert
Knowledge Area: Data Structures / Automata
Software or Tool: Prefix data structure library
Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: A trie data structure for prefix matching, then evolve it into a Finite State Transducer (FST) that shares both prefixes AND suffixes—the data structure Lucene uses for its term dictionary.

Why it teaches full-text search: The term dictionary is consulted on every single query. It must be FAST and SMALL. Understanding tries and FSTs teaches you how Lucene achieves sub-millisecond term lookups with million-term vocabularies while using minimal RAM.

Core challenges you’ll face:

Implementing a basic trie → maps to recursive tree structures and pointer manipulation
Adding value outputs (trie → transducer) → maps to understanding FST output semantics
Sharing suffixes (trie → FST) → maps to understanding automaton minimization
Persisting to disk efficiently → maps to compact binary representation

Key Concepts:

Tries: “Algorithms” Ch. 5.2 by Sedgewick
FST Construction: Mihov & Maurel paper (2001)
Lucene’s FST: Mike McCandless blog series

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Basic data structures, recursion, pointers/references

Real World Outcome

You’ll have your own FST implementation that rivals Lucene’s:

Example Output:

$ ./fst-builder build --input words.txt --output words.fst

=== FST BUILD ===

Input: 235,886 words
Building trie...
  Nodes: 1,847,342
  Memory: 89.3 MB

Minimizing to FST (sharing suffixes)...
  States: 328,451 (82% reduction!)
  Arcs: 412,893
  Memory: 4.2 MB (95% reduction from naive trie!)

Writing to disk...
  File size: 3.8 MB

$ ./fst-builder lookup --fst words.fst "algorithm"
Found: "algorithm" → output: 42891 (postings offset)
Lookup time: 0.4 μs

$ ./fst-builder prefix --fst words.fst "algo"
Prefix matches (5 found):
  "algo" → 42890
  "algorithm" → 42891
  "algorithms" → 42892
  "algorithmic" → 42893
  "algorithmically" → 42894

$ ./fst-builder stats --fst words.fst

=== FST STATISTICS ===

Total states: 328,451
Accepting states: 235,886
Total arcs: 412,893
Avg arcs per state: 1.26

Compression analysis:
  Raw data (term + offset): 4.7 MB
  Trie representation: 89.3 MB (19x larger!)
  FST representation: 3.8 MB (19% smaller than raw!)

How is FST smaller than raw data?
  → Prefix sharing: "algorithm", "algorithms" share "algorithm"
  → Suffix sharing: "running", "jumping" share "ing"
  → Variable-length arc encoding

The Core Question You’re Answering

“How does Lucene look up any of its millions of terms in microseconds while using only a few MB of RAM?”

The Finite State Transducer is the answer. It’s like a trie (shares prefixes) but also shares suffixes, making it incredibly compact. And it outputs values (like file offsets) as you traverse, making it a complete term → offset map.

Concepts You Must Understand First

Stop and research these before coding:

Trie Basics
- What is the structure of a trie node?
- How do you insert, lookup, and delete?
- What’s the space overhead of a naive trie?
- Book Reference: “Algorithms” Ch. 5.2 by Sedgewick
Automaton Minimization
- What makes two automaton states equivalent?
- What’s the algorithm for minimization?
- How much does minimization reduce size?
- Book Reference: “Introduction to Theory of Computation” by Sipser
FST Output Semantics
- How do outputs accumulate along paths?
- What’s the difference between FST and DAWG (FSA)?
- How do you output shared prefixes efficiently?
- Book Reference: Mike McCandless blog: “Using FSTs in Lucene”

Questions to Guide Your Design

Before implementing, think through these:

Node Representation
- How do you represent children: array, hashmap, or sorted list?
- What’s the trade-off between memory and speed?
- How do you represent arc labels efficiently?
Construction Algorithm
- Can you build incrementally or must you have all inputs upfront?
- Lucene requires sorted input—why?
- How do you minimize on-the-fly vs post-construction?
Persistence
- What binary format should you use?
- How do you memory-map the FST?
- What metadata do you need in the file header?

Thinking Exercise

Build a Mini-FST by Hand

Given these words: cat, car, card, care, careful, carefully

Draw the trie
Identify shared suffixes
Draw the minimized FST
Assign outputs (values 1-6) and show output distribution

Questions while building:

Which states can be merged in the trie?
How do you split the output between shared arcs?
How many states/arcs in trie vs FST?

The Interview Questions They’ll Ask

Prepare to answer these:

“What is a trie and what are its time/space characteristics?”
“How does an FST differ from a trie?”
“Why does Lucene require sorted input for FST construction?”
“How do FST outputs work and how are they distributed across arcs?”
“What’s the difference between an FSA and an FST?”

Hints in Layers

Hint 1: Start with Trie Build a working trie first. Don’t worry about minimization yet.

Hint 2: Study the fst Crate Rust’s fst crate by BurntSushi is an excellent reference implementation.

Hint 3: Sorted Input Build from sorted input. This enables single-pass construction with minimization.

Hint 4: Test with Small Vocabularies Debug with 5-10 words where you can verify by hand.

Books That Will Help

Topic	Book	Chapter
Tries	“Algorithms” by Sedgewick	Ch. 5.2
Automata minimization	“Theory of Computation” by Sipser	Ch. 1
FST implementation	McCandless blog series	“Using FSTs in Lucene”

Learning milestones:

Trie works for insert/lookup → You understand prefix trees
FST minimizes and shares suffixes → You understand automaton minimization
FST persists and memory-maps → You’re ready to build Lucene-scale systems

Project 9: Segment-Based Index with Background Merging

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Go, Java, C++
Coolness Level: Level 4: Hardcore Tech Flex
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 4: Expert
Knowledge Area: Systems Programming / Concurrency
Software or Tool: Search engine core
Main Book: “Designing Data-Intensive Applications” by Kleppmann

What you’ll build: A Lucene-style segmented index where new documents go to in-memory buffers, periodically flush to immutable segments, and background threads merge segments together.

Why it teaches full-text search: Real search engines never update documents in place—they use append-only segments. Understanding this teaches you how Elasticsearch achieves both fast indexing AND fast search, and why “near-real-time” search exists.

Core challenges you’ll face:

Managing concurrent reads and writes → maps to understanding lockless data structures
Implementing segment merging → maps to understanding merge policies
Handling deletes with tombstones → maps to understanding soft deletes and compaction
Making search see new documents → maps to understanding refresh and commit

Key Concepts:

LSM Trees: “Designing Data-Intensive Applications” Ch. 3
Lucene Segments: Lucene documentation
Merge Policies: “Information Retrieval: Implementing and Evaluating” Ch. 5

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, concurrency experience

Real World Outcome

Example Output:

$ ./segment-index start --data-dir ./index

=== SEGMENT-BASED INDEX ===

In-memory buffer: 0 documents
Segments on disk: 0
Background merger: running

$ ./segment-index add-batch ./docs/*.txt
Adding 10,000 documents...
Buffer filled, flushing to segment_001.seg
Buffer filled, flushing to segment_002.seg
10,000 documents indexed in 2.3 seconds

$ ./segment-index status

Current state:
  In-memory buffer: 423 documents (not yet searchable)
  Segments:
    segment_001.seg: 4,500 docs, 12.3 MB
    segment_002.seg: 4,500 docs, 11.9 MB
    segment_003.seg: 577 docs, 1.4 MB

  Background merge in progress:
    Merging segment_001 + segment_002 → segment_004
    Progress: 67%

$ ./segment-index refresh
Refreshing search view...
423 documents now searchable (flushed to segment_005.seg)

$ ./segment-index delete --query "status:obsolete"
Marked 234 documents as deleted (tombstoned)
Documents still in segments but filtered from results

The Core Question You’re Answering

“How does Elasticsearch add new documents instantly without rebuilding the entire index?”

The answer is segments. New documents go to an in-memory buffer, which periodically flushes to a new immutable segment on disk. Searches query ALL segments and merge results. Background merging consolidates small segments into larger ones.

Concepts You Must Understand First

Log-Structured Merge Trees (LSM)
- Why are writes faster than updates?
- What is write amplification?
- Book Reference: “Designing Data-Intensive Applications” Ch. 3
Segment Merging
- Why merge segments?
- What’s a merge policy?
- Book Reference: Lucene documentation on TieredMergePolicy
Concurrency Control
- How do readers see consistent snapshots?
- How do you delete without blocking readers?
- Book Reference: “Designing Data-Intensive Applications” Ch. 7

The Interview Questions They’ll Ask

“What is a segment in Lucene and why are they immutable?”
“How does Elasticsearch achieve near-real-time search?”
“What is a merge policy and why does it matter?”
“How are deletes handled in a segment-based index?”
“What’s the difference between refresh and commit?”

Learning milestones:

Multi-segment search works → You understand segment architecture
Background merging runs without blocking → You understand concurrency
Deletes filter correctly → You understand tombstones

Project 10: PostgreSQL Full-Text Search Integration

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: SQL / Python
Alternative Programming Languages: Go, Rust (with postgres libs)
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 2: Intermediate
Knowledge Area: Databases / SQL
Software or Tool: PostgreSQL
Main Book: PostgreSQL Documentation (Ch. 12: Full Text Search)

What you’ll build: A search application using PostgreSQL’s built-in full-text search with tsvector, tsquery, GIN indexes, and custom ranking—understanding when PostgreSQL is “good enough” vs when you need Elasticsearch.

Why it teaches full-text search: PostgreSQL’s FTS implementation is elegant and powerful. Understanding it teaches you that you don’t always need a separate search engine, and shows how GIN indexes implement inverted indexes inside a relational database.

Core challenges you’ll face:

Understanding tsvector and tsquery → maps to tokenization and query parsing
Creating GIN indexes → maps to understanding inverted index storage
Tuning text search configuration → maps to analyzer configuration
Combining FTS with SQL queries → maps to practical search integration

Key Concepts:

tsvector/tsquery: PostgreSQL Ch. 12.1-12.2
GIN Indexes: PostgreSQL Ch. 12.9
Ranking: PostgreSQL Ch. 12.3

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic SQL, Project 1 understanding

Real World Outcome

Example Output:

-- Create a table with full-text search
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    tsv tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', title), 'A') ||
        setweight(to_tsvector('english', body), 'B')
    ) STORED
);

CREATE INDEX articles_tsv_idx ON articles USING GIN (tsv);

-- Search with ranking
SELECT
    id,
    title,
    ts_rank(tsv, query) AS rank,
    ts_headline('english', body, query) AS snippet
FROM
    articles,
    to_tsquery('english', 'python & machine & learning') query
WHERE
    tsv @@ query
ORDER BY
    rank DESC
LIMIT 10;

-- Results:
 id |              title              | rank  |         snippet
----+---------------------------------+-------+---------------------------
 42 | Python ML Tutorial              | 0.875 | ...using <b>Python</b>...
 17 | Machine Learning Basics         | 0.654 | ...<b>machine learning</b>...

The Core Question You’re Answering

“When do I need Elasticsearch, and when is PostgreSQL full-text search good enough?”

PostgreSQL FTS is sufficient for many applications: blogs, documentation sites, e-commerce with <1M products. Understanding its capabilities helps you make informed decisions.

The Interview Questions They’ll Ask

“What is a tsvector in PostgreSQL?”
“When would you use PostgreSQL FTS vs Elasticsearch?”
“How does a GIN index differ from a B-tree index?”
“How do you handle multi-language search in PostgreSQL?”
“What are the limitations of PostgreSQL full-text search?”

Learning milestones:

Basic FTS queries work → You understand tsvector/tsquery
GIN index speeds up queries → You understand indexing
You can articulate PostgreSQL vs Elasticsearch trade-offs → You’re ready for architecture decisions

Project 11: Query Parser with Boolean Logic

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Python, Go, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 3: Advanced
Knowledge Area: Parsing / Language Theory
Software or Tool: Parser library
Main Book: “Writing a C Compiler” by Nora Sandler (parsing chapters)

What you’ll build: A query parser that handles boolean operators (AND, OR, NOT), parentheses for grouping, phrase queries, field-specific searches (title:python), and wildcards—like Lucene’s query syntax.

Why it teaches full-text search: The gap between user input and executable query is huge. Users type python AND (web OR api) title:"tutorial", and your system must parse this into a query plan that efficiently executes against the index.

Core challenges you’ll face:

Lexical analysis (tokenizing queries) → maps to understanding tokenizers
Parsing with operator precedence → maps to recursive descent or Pratt parsing
Building a query AST → maps to abstract syntax tree construction
Optimizing query execution → maps to query planning

Key Concepts:

Recursive Descent Parsing: “Writing a C Compiler” Ch. 2
Pratt Parsing: Pratt’s original paper or Crafting Interpreters Ch. 17
Query Optimization: “Introduction to Information Retrieval” Ch. 7

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Basic parsing knowledge, Project 1

Real World Outcome

Example Output:

$ ./query-parser parse 'python AND (web OR api) title:"machine learning"'

=== QUERY PARSE TREE ===

Input: python AND (web OR api) title:"machine learning"

Tokens:
  [TERM:python] [AND] [LPAREN] [TERM:web] [OR] [TERM:api] [RPAREN]
  [FIELD:title] [PHRASE:"machine learning"]

AST:
  AND
  ├── TERM("python", field=*)
  ├── OR
  │   ├── TERM("web", field=*)
  │   └── TERM("api", field=*)
  └── PHRASE("machine learning", field=title)

Optimized Plan:
  1. Evaluate PHRASE on title field (most selective)
  2. Intersect with "python" postings
  3. Union "web" and "api" postings
  4. Intersect result of step 2 with step 3

Estimated cost: 0.34 (fast query)

$ ./query-parser parse 'foo AND NOT bar OR baz'

=== OPERATOR PRECEDENCE ===

Without parentheses, parsed as:
  (foo AND (NOT bar)) OR baz

NOT binds tighter than AND, which binds tighter than OR.

If you meant: foo AND NOT (bar OR baz)
Use explicit parentheses!

The Core Question You’re Answering

“How does a search engine turn ‘python AND (web OR api)’ into an efficient query plan?”

Parsing. You must tokenize the query string, build an abstract syntax tree respecting operator precedence, and then optimize execution order based on selectivity estimates.

The Interview Questions They’ll Ask

“How would you implement a query parser for boolean search?”
“What is operator precedence and how do you handle it?”
“How would you extend the parser for new syntax (e.g., boosting)?”
“What optimizations can you do on a query AST before execution?”
“How do you handle user errors in queries gracefully?”

Learning milestones:

Boolean queries parse correctly → You understand recursive descent
Operator precedence works → You understand grammar design
Query execution is optimized → You understand query planning

Project 12: Autocomplete with Prefix Matching

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: TypeScript
Alternative Programming Languages: Rust, Python, Go
Coolness Level: Level 3: Genuinely Clever
Business Potential: 2. The “Micro-SaaS / Pro Tool”
Difficulty: Level 2: Intermediate
Knowledge Area: Data Structures / UX
Software or Tool: Autocomplete service
Main Book: “Algorithms” by Sedgewick (Tries chapter)

What you’ll build: A fast autocomplete system that suggests completions as users type, using tries or FSTs for prefix matching, with popularity-weighted ranking.

Why it teaches full-text search: Autocomplete is the first thing users interact with. It must be FAST (<50ms) and SMART (suggest popular/relevant completions). This teaches you prefix data structures and the importance of precomputation.

Core challenges you’ll face:

Fast prefix lookup → maps to trie/FST traversal
Ranking suggestions → maps to popularity and recency scoring
Handling typos in prefix → maps to fuzzy prefix matching
Scaling to millions of terms → maps to memory-efficient data structures

Key Concepts:

Prefix Trees: “Algorithms” Ch. 5.2
Completion Scoring: Meilisearch’s approach
Edge N-grams: Elasticsearch documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 8 (Tries/FST)

Real World Outcome

Example Output:

$ ./autocomplete serve --data products.json --port 8080
Loading 1.2M products...
Building prefix index...
Ready! Listening on :8080

$ curl "localhost:8080/suggest?q=iph"

{
  "query": "iph",
  "suggestions": [
    {"text": "iphone 15 pro", "score": 0.95, "category": "Electronics"},
    {"text": "iphone 15", "score": 0.92, "category": "Electronics"},
    {"text": "iphone case", "score": 0.78, "category": "Accessories"},
    {"text": "iphone charger", "score": 0.71, "category": "Accessories"}
  ],
  "time_ms": 2.3
}

$ curl "localhost:8080/suggest?q=iphon&fuzzy=true"

{
  "query": "iphon",
  "suggestions": [
    {"text": "iphone 15 pro", "score": 0.93, "fuzzy_match": true},
    {"text": "iphone 15", "score": 0.90, "fuzzy_match": true}
  ]
}

The Core Question You’re Answering

“How does Google show suggestions before you even finish typing?”

Prefix data structures (tries/FSTs) enable instant prefix lookup. Combined with popularity scoring and precomputation, you can suggest completions in under 10ms.

The Interview Questions They’ll Ask

“How would you design an autocomplete system for 1 billion queries?”
“What data structure would you use and why?”
“How do you handle typos in autocomplete?”
“How do you rank suggestions—just popularity, or something smarter?”
“How do you update the suggestion index in real-time?”

Learning milestones:

Prefix suggestions work fast → You understand tries
Ranking feels natural → You understand scoring
System handles typos → You understand fuzzy prefix matching

Project 13: Faceted Search and Aggregations

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Python, Java
Coolness Level: Level 3: Genuinely Clever
Business Potential: 3. The “Service & Support” Model
Difficulty: Level 3: Advanced
Knowledge Area: E-commerce / Analytics
Software or Tool: Search with facets
Main Book: “Solr in Action” by Grainger & Potter

What you’ll build: A search engine that computes facet counts (e.g., “Color: Red (42), Blue (38)”) and aggregations (e.g., “Average price: $45.99”) alongside search results—like every e-commerce site.

Why it teaches full-text search: Facets transform search from “find things” to “explore a catalog.” Understanding how to efficiently compute facet counts teaches you about doc values, field data, and the trade-offs between indexing and query-time computation.

Core challenges you’ll face:

Computing facet counts efficiently → maps to understanding doc values vs inverted index
Hierarchical facets (Category > Subcategory) → maps to tree aggregation
Range facets (price ranges) → maps to numeric indexing
Combining facets with search filters → maps to filter caching

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1, Project 4

Real World Outcome

Example Output:

$ ./facet-search query "laptop" --facets category,brand,price_range

=== SEARCH: "laptop" ===

Results: 1,247 laptops found

FACETS:

Category:
  ├── Electronics (1,247)
  │   ├── Computers (1,180)
  │   │   ├── Laptops (1,180)
  │   │   └── Accessories (67)

Brand:
  ├── Apple (234)
  ├── Dell (189)
  ├── Lenovo (156)
  ├── HP (142)
  └── [+12 more...]

Price Range:
  ├── $0-500 (312)
  ├── $500-1000 (456)
  ├── $1000-2000 (367)
  └── $2000+ (112)

AGGREGATIONS:
  ├── Avg Price: $1,234.56
  ├── Min Price: $299.00
  └── Max Price: $4,999.00

$ ./facet-search query "laptop" --filter "brand:Apple" --filter "price:[1000 TO 2000]"

=== FILTERED SEARCH ===

Results: 87 Apple laptops between $1000-$2000

Facet counts updated to reflect filters!

The Core Question You’re Answering

“How does Amazon show ‘1,234 results’ with facet breakdowns by category, brand, and price—all in milliseconds?”

The answer is doc values (columnar storage for field values) and careful pre-computation. Facet counts are computed by scanning a compact column of values, not by re-querying the inverted index.

The Interview Questions They’ll Ask

“How would you implement faceted search efficiently?”
“What’s the difference between doc values and the inverted index?”
“How do you handle hierarchical facets?”
“How do facet counts update when filters are applied?”
“What’s the performance trade-off for real-time vs cached facets?”

Learning milestones:

Facet counts are accurate → You understand doc values
Filtering updates facets correctly → You understand filter interaction
Performance is acceptable → You understand caching strategies

Project 14: Distributed Search Cluster

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Go
Alternative Programming Languages: Rust, Java
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Distributed Systems
Software or Tool: Distributed search engine
Main Book: “Designing Data-Intensive Applications” by Kleppmann

What you’ll build: A multi-node search cluster with sharding, replication, query routing, and result merging—a mini Elasticsearch.

Why it teaches full-text search: Single-node search is easy. Distributed search is HARD. This project teaches you the scatter-gather pattern, consistency trade-offs, and why Elasticsearch configuration is so complex.

Core challenges you’ll face:

Sharding data across nodes → maps to partition strategies
Replicating for fault tolerance → maps to consistency models
Routing queries to correct shards → maps to coordinator pattern
Merging ranked results from shards → maps to distributed top-K

Difficulty: Master Time estimate: 1 month+ Prerequisites: All previous projects, distributed systems experience

Real World Outcome

Example Output:

$ ./cluster-search start --nodes 3 --shards 6 --replicas 1

=== DISTRIBUTED SEARCH CLUSTER ===

Starting node-1 on :9200
Starting node-2 on :9201
Starting node-3 on :9202

Shard allocation:
  node-1: shard-0 (P), shard-1 (P), shard-3 (R), shard-5 (R)
  node-2: shard-2 (P), shard-3 (P), shard-0 (R), shard-4 (R)
  node-3: shard-4 (P), shard-5 (P), shard-1 (R), shard-2 (R)

Cluster health: GREEN (all shards allocated)

$ ./cluster-search index --bulk ./million_docs.json
Indexing 1,000,000 documents...
Routing to shards by hash(doc_id) % 6
  Shard 0: 166,823 docs
  Shard 1: 166,412 docs
  ...
Complete! 1M docs indexed in 45.3 seconds

$ ./cluster-search query "distributed systems" --explain

=== DISTRIBUTED QUERY ===

Coordinator: node-1
Query broadcast to: shard-0, shard-1, shard-2, shard-3, shard-4, shard-5

Shard responses:
  shard-0 (node-1): 234 hits, top score 12.4, took 23ms
  shard-1 (node-1): 189 hits, top score 11.2, took 21ms
  shard-2 (node-2): 201 hits, top score 13.1, took 25ms
  ...

Merging top-10 from all shards...
Final results: 1,247 total hits

$ ./cluster-search kill node-2
Node-2 killed!

$ ./cluster-search status
Cluster health: YELLOW
  shard-2: Promoting replica on node-3 to primary
  shard-3: Promoting replica on node-1 to primary

Rebalancing...
Cluster health: GREEN

The Core Question You’re Answering

“How does Elasticsearch search across billions of documents on hundreds of nodes in milliseconds?”

Sharding divides the data; replication provides fault tolerance; the scatter-gather pattern parallelizes search; and careful result merging produces globally correct rankings.

The Interview Questions They’ll Ask

“How would you design a distributed search system?”
“What happens when a node fails during a query?”
“How do you ensure globally correct rankings across shards?”
“What’s the trade-off between more shards and query latency?”
“How does replication interact with consistency?”

Learning milestones:

Multi-node queries work → You understand scatter-gather
Node failure is handled → You understand replication
Rankings are globally correct → You understand distributed top-K

Project 15: Hybrid Search (Keyword + Vector)

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Python
Alternative Programming Languages: Rust, Go
Coolness Level: Level 5: Pure Magic
Business Potential: 5. The “Industry Disruptor”
Difficulty: Level 4: Expert
Knowledge Area: ML / Embeddings / Modern Search
Software or Tool: Hybrid search engine
Main Book: “AI Engineering” by Chip Huyen

What you’ll build: A search engine that combines traditional keyword search (BM25) with semantic vector search (embeddings), using reciprocal rank fusion to merge results—the future of search.

Why it teaches full-text search: Modern search isn’t just keywords OR vectors—it’s both. Understanding how to combine them teaches you the limits of each approach and prepares you for RAG (Retrieval-Augmented Generation) systems.

Core challenges you’ll face:

Generating embeddings → maps to understanding transformer models
Approximate nearest neighbor search → maps to HNSW, IVF indexes
Fusion of keyword and vector results → maps to RRF, score normalization
Balancing precision and recall → maps to hybrid tuning

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 1-4, basic ML understanding

Real World Outcome

Example Output:

$ ./hybrid-search index --docs ./knowledge_base.json --embed-model all-MiniLM-L6-v2

=== HYBRID INDEX BUILD ===

Indexing 50,000 documents...
Building BM25 inverted index...
Generating embeddings (384 dimensions)...
Building HNSW vector index...

Index complete!
  BM25 index: 12.3 MB
  Vector index: 78.4 MB (50K × 384 × 4 bytes + overhead)

$ ./hybrid-search query "how to deploy machine learning models to production"

=== HYBRID SEARCH ===

BM25 Results (keyword match):
  1. "Deploying ML Models with Docker" (BM25: 18.4)
  2. "Production Machine Learning" (BM25: 16.2)
  3. "Model Deployment Guide" (BM25: 15.1)

Vector Results (semantic similarity):
  1. "Getting Your Models Ready for Real Users" (sim: 0.87)
     └── Note: No keyword "deploy" but semantically relevant!
  2. "From Jupyter to Production" (sim: 0.84)
  3. "MLOps Best Practices" (sim: 0.82)

Reciprocal Rank Fusion (k=60):
  1. "Deploying ML Models with Docker" (RRF: 0.032)
  2. "Getting Your Models Ready for Real Users" (RRF: 0.029)
  3. "Production Machine Learning" (RRF: 0.028)

Vector search found documents that keyword search missed!
Keyword search provided precision for exact terms!

The Core Question You’re Answering

“Why do modern search engines combine keywords and embeddings, and how do you merge their results?”

Keywords are precise but miss synonyms and paraphrases. Vectors capture semantics but struggle with exact matches. Combining them gives you the best of both worlds—and is essential for RAG systems.

The Interview Questions They’ll Ask

“What are the pros and cons of keyword vs vector search?”
“How does reciprocal rank fusion work?”
“What embedding model would you use and why?”
“How do you index vectors efficiently (HNSW, IVF)?”
“How would you tune the balance between keyword and vector results?”

Learning milestones:

Both search types work independently → You understand each approach
Fusion produces better results than either alone → You understand complementarity
You can tune the hybrid balance → You’re ready for production RAG

Project Comparison Table

#	Project	Difficulty	Time	Depth of Understanding	Fun Factor
1	Basic Inverted Index	Beginner	Weekend	★★★★★	★★★☆☆
2	Text Analyzer	Intermediate	1 week	★★★★☆	★★★☆☆
3	TF-IDF Ranking	Intermediate	1 week	★★★★★	★★★★☆
4	BM25 Scoring	Advanced	1-2 weeks	★★★★★	★★★★☆
5	Positional Index	Advanced	1-2 weeks	★★★★☆	★★★★☆
6	Index Compression	Expert	2 weeks	★★★★★	★★★☆☆
7	Fuzzy Search	Expert	2-3 weeks	★★★★★	★★★★★
8	Trie/FST	Expert	2-3 weeks	★★★★★	★★★★☆
9	Segment Merging	Expert	3-4 weeks	★★★★★	★★★☆☆
10	PostgreSQL FTS	Intermediate	1 week	★★★☆☆	★★★★☆
11	Query Parser	Advanced	2 weeks	★★★★☆	★★★★☆
12	Autocomplete	Intermediate	1 week	★★★☆☆	★★★★★
13	Faceted Search	Advanced	2 weeks	★★★★☆	★★★★☆
14	Distributed Cluster	Master	1 month+	★★★★★	★★★★★
15	Hybrid Search	Expert	2-3 weeks	★★★★★	★★★★★

Recommendation

For beginners: Start with Project 1 (Basic Inverted Index), then Project 3 (TF-IDF), then Project 10 (PostgreSQL FTS). This gives you the core concepts plus practical database integration.

For intermediate developers: Do Projects 1-4 in order, then jump to Project 7 (Fuzzy Search) or Project 12 (Autocomplete) for visible, fun results.

For advanced/systems engineers: Focus on Projects 6, 8, 9, 14—the low-level data structures and distributed systems that power production search engines.

For ML/AI engineers: Start with Projects 1-4 for foundations, then jump to Project 15 (Hybrid Search) to understand how traditional search integrates with embeddings for RAG.

Final Overall Project: Build a Mini Meilisearch

File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
Main Programming Language: Rust
Alternative Programming Languages: Go, C++
Coolness Level: Level 5: Pure Magic
Business Potential: 4. The “Open Core” Infrastructure
Difficulty: Level 5: Master
Knowledge Area: Full-Stack Search Engine
Software or Tool: Complete search engine
Main Book: All books from this learning path

What you’ll build: A complete, production-quality search engine with: inverted index with compression, BM25 ranking, typo tolerance, faceted search, real-time indexing with segments, a query parser, and an HTTP API—inspired by Meilisearch.

Why this is the capstone project: This combines EVERYTHING from the previous projects into a single, cohesive system. You’ll make architectural decisions about trade-offs, experience the joy of seeing it all work together, and have a project that demonstrates deep search expertise.

Core challenges you’ll face:

Integrating all components → maps to system design and API boundaries
Achieving sub-50ms query latency → maps to performance optimization
Handling concurrent reads and writes → maps to production systems
Building a usable HTTP API → maps to developer experience

What the final product looks like:

$ ./minimeili start --data-dir ./index --port 7700
🔎 MiniMeili starting...
   Index loaded: 0 documents
   HTTP API listening on :7700

$ curl -X POST localhost:7700/indexes/products/documents \
    -H "Content-Type: application/json" \
    -d '[{"id":1,"name":"iPhone 15 Pro","price":999},...]'

{"taskId": 1, "status": "enqueued"}

$ curl "localhost:7700/indexes/products/search?q=iphne&limit=10"

{
  "hits": [
    {
      "id": 1,
      "name": "iPhone 15 Pro",
      "price": 999,
      "_matchesInfo": {
        "name": [{"start": 0, "length": 6}]
      }
    }
  ],
  "query": "iphne",
  "processingTimeMs": 12,
  "estimatedTotalHits": 47,
  "facets": {
    "brand": {"Apple": 23, "Samsung": 12},
    "priceRange": {"$0-500": 15, "$500-1000": 32}
  }
}

Time estimate: 2-3 months Prerequisites: Complete at least projects 1-8

Summary

This learning path covers Full-Text Search through 15 hands-on projects plus one capstone. Here’s the complete list:

#	Project Name	Main Language	Difficulty	Time Estimate
1	Basic Inverted Index	Rust	Beginner	Weekend
2	Text Analyzer with Stemming	Python	Intermediate	1 week
3	TF-IDF Ranking Engine	Python	Intermediate	1 week
4	BM25 Scoring Implementation	Rust	Advanced	1-2 weeks
5	Positional Index and Phrase Queries	C	Advanced	1-2 weeks
6	Index Compression	C	Expert	2 weeks
7	Fuzzy Search with Levenshtein Automata	Rust	Expert	2-3 weeks
8	Build a Trie and FST from Scratch	Rust	Expert	2-3 weeks
9	Segment-Based Index with Merging	Rust	Expert	3-4 weeks
10	PostgreSQL Full-Text Search	SQL/Python	Intermediate	1 week
11	Query Parser with Boolean Logic	Rust	Advanced	2 weeks
12	Autocomplete with Prefix Matching	TypeScript	Intermediate	1 week
13	Faceted Search and Aggregations	Go	Advanced	2 weeks
14	Distributed Search Cluster	Go	Master	1 month+
15	Hybrid Search (Keyword + Vector)	Python	Expert	2-3 weeks
★	Mini Meilisearch (Capstone)	Rust	Master	2-3 months

Recommended Learning Path

For beginners: Start with projects #1, #2, #3, #10 For intermediate: Focus on projects #1, #3, #4, #7, #12 For advanced: Deep dive into projects #6, #8, #9, #14 For ML/AI focus: Complete #1-4, then #15

Expected Outcomes

After completing these projects, you will:

Understand inverted indexes deeply—the data structure at the heart of every search engine
Know exactly how ranking algorithms work—TF-IDF, BM25, and why they matter
Be able to read Lucene/Elasticsearch source code—because you’ve built the components yourself
Make informed architecture decisions—PostgreSQL FTS vs Elasticsearch vs custom solution
Implement production-ready search features—fuzzy search, autocomplete, facets, highlighting
Understand distributed search—sharding, replication, consistency trade-offs
Bridge keyword and semantic search—prepare for RAG and modern AI applications

You’ll have built 16 working projects that demonstrate deep understanding of full-text search from first principles to distributed production systems.

Resources

Essential Books

“Introduction to Information Retrieval” by Manning, Raghavan, Schütze — Free online
“Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman — Free PDF
“Information Retrieval: Implementing and Evaluating Search Engines” by Büttcher, Clarke, Cormack
“Managing Gigabytes” by Witten, Moffat, Bell
“Designing Data-Intensive Applications” by Martin Kleppmann

Key Papers

Porter, M.F. (1980): “An algorithm for suffix stripping”
Robertson, S.E. & Walker, S. (1994): “Some simple effective approximations to the 2-Poisson model” (BM25)
Schulz, K.U. & Mihov, S. (2002): “Fast string correction with Levenshtein automata”
Lemire, D. et al. (2016): “Roaring bitmaps”

Blogs & Online Resources

Mike McCandless’s blog — Lucene internals
Elastic blog — Practical Elasticsearch
Meilisearch blog — Modern search architecture
Stanford NLP IR Book — Full text online

Happy searching! 🔎