← Back to all projects

LEARN FULL TEXT SEARCH DEEP DIVE

Learn Full-Text Search: From Zero to Search Engine Architect

Goal: Deeply understand full-text search—from the fundamental data structures (inverted indexes, tries, FSTs) to ranking algorithms (TF-IDF, BM25), text analysis pipelines (tokenization, stemming, lemmatization), and distributed search architectures. You’ll learn not just how to USE search engines like Elasticsearch, Meilisearch, and PostgreSQL full-text search, but HOW they work internally by building your own implementations from scratch.


Why Full-Text Search Matters

Every time you type in Google, search your email, find a product on Amazon, or use Ctrl+F in a document, you’re using full-text search. It’s one of the most fundamental problems in computer science, and the solutions are elegant, mathematically beautiful, and deeply practical.

The Scale of the Problem

Consider this: Google indexes over 130 trillion web pages. When you type a query, it returns results in under 200 milliseconds. How is this even possible?

The naive approach—scanning every document for your search terms—would take years. Instead, search engines use a brilliant data structure called the inverted index, combined with sophisticated ranking algorithms, compression techniques, and distributed systems.

Historical Context

Timeline of Search Innovation:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

1945 │ Vannevar Bush proposes "Memex" - conceptual search machine
     │
1960 │ Gerard Salton develops SMART system - first vector space model
     │
1975 │ Robertson & Spärck Jones develop probabilistic retrieval (BM25 roots)
     │
1988 │ "Okapi" system introduces BM25 ranking function
     │
1997 │ Google founded - PageRank revolutionizes web search
     │
1999 │ Doug Cutting starts Lucene project
     │
2004 │ Lucene becomes Apache project
     │
2010 │ Elasticsearch launched (built on Lucene)
     │
2018 │ Meilisearch founded (Rust-based alternative)
     │
2024 │ Vector search + LLMs merge with traditional search (RAG)

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

What Understanding Search Unlocks

  1. Database Internals: Inverted indexes are used everywhere—not just search
  2. Distributed Systems: Search clusters are masterclasses in distributed computing
  3. Algorithm Design: Ranking algorithms balance precision, recall, and performance
  4. Data Structures: Tries, FSTs, skip lists, roaring bitmaps—all in one domain
  5. NLP Fundamentals: Tokenization, stemming, and language processing
  6. Compression: Real-world application of information theory

Core Concept Analysis

The Fundamental Problem

You have documents. Users have queries. How do you find matches FAST?

THE NAIVE APPROACH (Don't do this!)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Query: "quick brown fox"

┌─────────────────────────────────────────────────────────────┐
│  For each document (millions of them):                      │
│    For each word in document:                               │
│      If word matches any query term:                        │
│        Mark as potential match                              │
│                                                             │
│  Time Complexity: O(documents × avg_doc_length × query_len) │
│  For 1 million docs of 1000 words: ~1 BILLION operations    │
└─────────────────────────────────────────────────────────────┘

Result: UNUSABLY SLOW 🐢
THE INVERTED INDEX APPROACH
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

PRE-PROCESSING (done once at index time):
Build a map: term → [list of documents containing term]

┌─────────────────────────────────────────────────────────────┐
│  "quick" → [doc1, doc7, doc42, doc999, ...]                │
│  "brown" → [doc1, doc3, doc7, doc15, ...]                  │
│  "fox"   → [doc1, doc7, doc88, ...]                        │
└─────────────────────────────────────────────────────────────┘

QUERY TIME:
1. Look up "quick" → get list A
2. Look up "brown" → get list B
3. Look up "fox"   → get list C
4. Intersect A ∩ B ∩ C → [doc1, doc7]

Time Complexity: O(query_terms × log(vocabulary) + intersection_cost)
For same query: ~100 operations

Result: BLAZING FAST 🚀

The Inverted Index: Heart of Every Search Engine

FORWARD INDEX vs INVERTED INDEX
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

FORWARD INDEX (how documents are stored):
┌─────────────────────────────────────────────────────────┐
│ Doc1: "The quick brown fox jumps over the lazy dog"    │
│ Doc2: "The lazy cat sleeps all day"                    │
│ Doc3: "Quick foxes are brown and clever"               │
└─────────────────────────────────────────────────────────┘

INVERTED INDEX (how we search):
┌──────────────┬───────────────────────────────────────────┐
│    TERM      │           POSTINGS LIST                   │
├──────────────┼───────────────────────────────────────────┤
│ "brown"      │ → [Doc1:pos3, Doc3:pos4]                  │
│ "cat"        │ → [Doc2:pos3]                             │
│ "clever"     │ → [Doc3:pos6]                             │
│ "dog"        │ → [Doc1:pos9]                             │
│ "fox"        │ → [Doc1:pos4]                             │
│ "foxes"      │ → [Doc3:pos2]                             │
│ "jumps"      │ → [Doc1:pos5]                             │
│ "lazy"       │ → [Doc1:pos8, Doc2:pos2]                  │
│ "over"       │ → [Doc1:pos6]                             │
│ "quick"      │ → [Doc1:pos2, Doc3:pos1]                  │
│ "sleeps"     │ → [Doc2:pos4]                             │
│ "the"        │ → [Doc1:pos1,pos7, Doc2:pos1]             │
└──────────────┴───────────────────────────────────────────┘
         │                        │
         │                        └── Postings: doc IDs + positions
         └── Dictionary: sorted terms (enables binary search)

Anatomy of a Postings List

POSTINGS LIST STRUCTURE
━━━━━━━━━━━━━━━━━━━━━━━━

For term "quick":

┌─────────────────────────────────────────────────────────────────┐
│                         POSTINGS LIST                           │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐      │
│  │ Doc ID  │───▶│ Doc ID  │───▶│ Doc ID  │───▶│ Doc ID  │      │
│  │   1     │    │   7     │    │   42    │    │  999    │      │
│  ├─────────┤    ├─────────┤    ├─────────┤    ├─────────┤      │
│  │ TF: 2   │    │ TF: 1   │    │ TF: 5   │    │ TF: 1   │      │
│  ├─────────┤    ├─────────┤    ├─────────┤    ├─────────┤      │
│  │Pos:[2,8]│    │Pos:[15] │    │Pos:[1,3,│    │Pos:[99] │      │
│  │         │    │         │    │ 7,12,20]│    │         │      │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘      │
│                                                                 │
│  TF = Term Frequency (how many times term appears in doc)      │
│  Pos = Positions (where in the document the term appears)      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

COMPRESSION: Delta Encoding
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Original Doc IDs:    [1, 7, 42, 999]
Delta Encoded:       [1, 6, 35, 957]  ← Store differences!
                      │  │   │   │
                      │  7-1 42-7 999-42
                      │
                      First value stored as-is

Why? Smaller numbers = fewer bits = massive space savings!

The Text Analysis Pipeline

Before text can be indexed, it must be processed:

TEXT ANALYSIS PIPELINE
━━━━━━━━━━━━━━━━━━━━━━━

Original Text: "The Quick Brown Foxes are RUNNING!"
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  1. CHARACTER FILTERING                                     │
│     Remove/replace special characters, handle encoding      │
│     Result: "The Quick Brown Foxes are RUNNING"            │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  2. TOKENIZATION                                            │
│     Split text into individual tokens                       │
│     Result: ["The", "Quick", "Brown", "Foxes", "are",      │
│              "RUNNING"]                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  3. LOWERCASING                                             │
│     Normalize case for consistent matching                  │
│     Result: ["the", "quick", "brown", "foxes", "are",      │
│              "running"]                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  4. STOP WORD REMOVAL                                       │
│     Remove common words that don't help search              │
│     Result: ["quick", "brown", "foxes", "running"]         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  5. STEMMING / LEMMATIZATION                                │
│     Reduce words to root form                               │
│                                                             │
│     STEMMING (Porter Algorithm - fast, approximate):        │
│     "foxes" → "fox", "running" → "run"                     │
│     Result: ["quick", "brown", "fox", "run"]               │
│                                                             │
│     LEMMATIZATION (dictionary lookup - accurate, slower):   │
│     "foxes" → "fox", "running" → "run", "better" → "good"  │
│     Considers part-of-speech and context                    │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│  6. ADDITIONAL FILTERS (optional)                           │
│     - Synonym expansion: "quick" → ["quick", "fast"]       │
│     - N-gram generation: "fox" → ["fo", "ox", "fox"]       │
│     - Phonetic encoding: "smith" → "SM0" (Soundex)         │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
              FINAL TOKENS: ["quick", "brown", "fox", "run"]
              (Ready to be added to inverted index)

Stemming Algorithms: Porter Stemmer

PORTER STEMMER RULES (Simplified)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

The algorithm applies rules in steps. Example rules:

STEP 1a: Plural forms
  ─────────────────────
  SSES → SS    (caresses → caress)
  IES  → I     (ponies → poni)
  SS   → SS    (caress → caress)  [no change]
  S    → ∅     (cats → cat)

STEP 1b: Past tense / gerund
  ─────────────────────────────
  (*v*)EED → EE   (agreed → agree, where *v* = contains vowel)
  (*v*)ED  → ∅    (plastered → plaster)
  (*v*)ING → ∅    (motoring → motor)

STEP 2: Derivational suffixes
  ─────────────────────────────
  ATIONAL → ATE  (relational → relate)
  TIONAL  → TION (conditional → condition)
  IZER    → IZE  (digitizer → digitize)
  FULNESS → FUL  (hopefulness → hopeful)

Example walkthrough:
  "generalizations"
      │
      ▼ Step 1a: Remove S
  "generalization"
      │
      ▼ Step 2: ATION → ATE
  "generalize"
      │
      ▼ Step 4: Remove IZE (if long enough)
  "general"

Ranking Algorithms: TF-IDF and BM25

Finding documents is only half the battle. You need to RANK them by relevance.

TF-IDF: TERM FREQUENCY × INVERSE DOCUMENT FREQUENCY
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

INTUITION:
┌─────────────────────────────────────────────────────────────────┐
│  A word is important to a document if:                         │
│                                                                 │
│  1. It appears FREQUENTLY in that document (TF ↑)              │
│     "elephant" appears 10 times → probably about elephants     │
│                                                                 │
│  2. It appears RARELY across all documents (IDF ↑)             │
│     "the" appears everywhere → not useful for ranking          │
│     "elephant" is rare → very useful for ranking               │
└─────────────────────────────────────────────────────────────────┘

FORMULA:
                                        Total Documents
TF-IDF(t,d) = TF(t,d) × log( ─────────────────────────────── )
                              Documents containing term t

EXAMPLE:
━━━━━━━━
Collection: 10,000 documents
Document D: contains "elephant" 5 times, "the" 50 times
"elephant" appears in 10 documents total
"the" appears in 9,500 documents total

TF-IDF("elephant", D) = 5 × log(10000/10) = 5 × 3 = 15.0
TF-IDF("the", D)      = 50 × log(10000/9500) = 50 × 0.02 = 1.0

Despite "the" appearing 10× more, "elephant" scores 15× higher!
BM25: THE MODERN STANDARD
━━━━━━━━━━━━━━━━━━━━━━━━━━

BM25 improves on TF-IDF with two key innovations:

1. TERM FREQUENCY SATURATION
   ──────────────────────────
   TF-IDF Problem: If "elephant" appears 100 times, is the doc
   really 10× more relevant than if it appears 10 times?

   BM25 Solution: Diminishing returns via saturation function

                     TF
   Saturated TF = ─────────
                  TF + k₁

   ┌────────────────────────────────────────────────────┐
   │                                                    │
   │  Score │     ............................          │
   │   ↑    │   ..                                      │
   │        │  .                                        │
   │        │ .     ← Asymptotic limit (saturation)    │
   │        │.                                          │
   │        │                                           │
   │        └──────────────────────────────────▶ TF     │
   │                                                    │
   │  After ~5-10 occurrences, additional mentions      │
   │  barely increase the score                         │
   └────────────────────────────────────────────────────┘

2. DOCUMENT LENGTH NORMALIZATION
   ─────────────────────────────
   TF-IDF Problem: Longer documents have more term occurrences
   just by having more words—unfair advantage!

   BM25 Solution: Normalize by document length

                     doc_length
   Length Factor = ──────────────────
                   avg_doc_length

   Short docs get a boost, long docs get penalized


BM25 FULL FORMULA:
━━━━━━━━━━━━━━━━━━

                                   TF(t,d) × (k₁ + 1)
Score(t,d) = IDF(t) × ─────────────────────────────────────────────
                      TF(t,d) + k₁ × (1 - b + b × |d|/avgdl)

Where:
  k₁ = 1.2 to 2.0 (term frequency saturation parameter)
  b  = 0.75 (length normalization parameter, 0=none, 1=full)
  |d| = length of document d
  avgdl = average document length in collection

Term Dictionary: Finite State Transducers (FST)

How do you look up terms efficiently when you have billions of unique words?

THE PROBLEM: TERM LOOKUP
━━━━━━━━━━━━━━━━━━━━━━━━

You have 10 million unique terms. For each query term, you need
to find its postings list. Options:

1. Hash Table: O(1) lookup, but HUGE memory footprint
   10M terms × 50 bytes avg = 500 MB just for keys!

2. Sorted Array + Binary Search: O(log n), still large
   Need to store every term fully expanded

3. Trie: Shares prefixes, but pointer overhead is massive

4. FST (Finite State Transducer): BEST OF ALL WORLDS
   - Shares prefixes AND suffixes
   - Maps terms to values (postings file offsets)
   - Incredibly compact
   - O(term_length) lookup
FST STRUCTURE
━━━━━━━━━━━━━

Terms to index: cat, car, card, care, careful, carefully

TRIE (shares prefixes only):
─────────────────────────────

                    (root)
                      │
                      c
                      │
                      a
                     /│\
                    / │ \
                   t  r  r
                   │  │  │
                  ∅  ∅   d ─── e
                            │   │
                           ∅    f
                                │
                                u
                                │
                                l ─── l
                                │     │
                               ∅      y
                                      │
                                     ∅

FST (shares prefixes AND suffixes):
─────────────────────────────────────

                    (root)
                      │
                      c
                      │
                      a
                     /│\
                    t r  r
                    │ │  │
                    ● ●  d ─── e ─── f ─── u ─── l
                         │           │           │\
                         ●           ●           ● y
                                                   │
                                                   ●

● = accepting state (valid word ends here)

The "ful" suffix is shared between "careful" and leads to
shared "ly" suffix for "carefully"!

MEMORY SAVINGS: 38-52% smaller than trie for large vocabularies
FST AS A MAP: term → postings_offset
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Each arc can have an OUTPUT value. Lucene uses this to map
terms to file offsets in the postings file:

     c/0      a/0      t/100
(S) ─────▶ (1) ─────▶ (2) ─────▶ ((3))
                       │
                       │ r/200
                       ▼
                     ((4)) ←── "car" maps to offset 200

                       │ d/50
                       ▼
                     ((5)) ←── "card" maps to offset 200+50=250

Outputs accumulate along the path!
"cat" = 0 + 0 + 100 = 100
"car" = 0 + 0 + 200 = 200
"card" = 0 + 0 + 200 + 50 = 250

Skip Lists: Fast Postings Intersection

When you search for “quick AND brown”, you need to intersect two postings lists.

POSTINGS LIST INTERSECTION
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Query: "quick" AND "brown"

Postings for "quick": [1, 5, 12, 45, 67, 89, 102, 156, 234, 567]
Postings for "brown": [3, 5, 22, 45, 89, 234, 345, 678]

NAIVE APPROACH: Compare every pair
Time: O(n × m) where n, m are list lengths

MERGE APPROACH: Walk both sorted lists together
Time: O(n + m)

  quick: [1, 5, 12, 45, 67, 89, 102, 156, 234, 567]
          ↑
  brown: [3, 5, 22, 45, 89, 234, 345, 678]
          ↑

  1 < 3: advance quick pointer
  5 = 5: MATCH! Add to result, advance both
  12 < 22: advance quick pointer
  ...

  Result: [5, 45, 89, 234]
SKIP LISTS: DO EVEN BETTER
━━━━━━━━━━━━━━━━━━━━━━━━━━━

Add "skip pointers" to jump ahead in long lists:

Postings for "the" (very common word):
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ 1 │ 2 │ 3 │ 5 │ 7 │ 8 │12 │15 │18 │22 │25 │...│
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘
  │           │           │           │
  └───────────┴───────────┴───────────┘
       Skip pointers (every √n elements)

When intersecting with a rare term's list [50, 100, 150]:
- Looking for 50
- At position 1, skip pointer says "jump to 12"
- 12 < 50, skip says "jump to 25"
- 25 < 50, skip says "jump to ..."
- Much faster than linear scan!

Optimal skip distance: √n for list of length n

Compression: Roaring Bitmaps

Modern search engines use sophisticated compression for postings:

ROARING BITMAPS
━━━━━━━━━━━━━━━

Problem: Storing doc IDs [1, 2, 3, 1000000, 1000001, 1000002]

Naive: Store each as 32-bit int = 24 bytes
Bitmap: 1 million bits = 125 KB (wasteful!)

ROARING: Hybrid approach
─────────────────────────

Divide 32-bit space into 65,536 "containers" of 65,536 values each:
- Container 0: values 0-65535
- Container 1: values 65536-131071
- Container 15: values 983040-1048575
- ...

Each container uses the best representation:

┌─────────────────────────────────────────────────────────────┐
│  If container has < 4096 values:                           │
│    Use ARRAY CONTAINER (sorted list of 16-bit offsets)     │
│                                                             │
│  If container has >= 4096 values:                          │
│    Use BITMAP CONTAINER (8KB bitmap)                       │
│                                                             │
│  If container has runs (1,2,3,4,5):                        │
│    Use RUN CONTAINER (store start + length)                │
└─────────────────────────────────────────────────────────────┘

Our example [1, 2, 3, 1000000, 1000001, 1000002]:

Container 0 (values 0-65535):
  Array: [1, 2, 3] → 6 bytes

Container 15 (values 983040-1048575):
  Array: [16960, 16961, 16962] → 6 bytes
  (These are 1000000-983040, etc.)

Total: ~12 bytes + overhead ≈ 20 bytes
vs 125 KB for naive bitmap!

Distributed Search Architecture

ELASTICSEARCH / SOLRCLOUD ARCHITECTURE
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

                         ┌─────────────┐
                         │   Client    │
                         └──────┬──────┘
                                │
                                ▼
                    ┌───────────────────────┐
                    │   Coordinating Node   │
                    │  (Routes & Aggregates)│
                    └───────────┬───────────┘
                                │
           ┌────────────────────┼────────────────────┐
           │                    │                    │
           ▼                    ▼                    ▼
    ┌─────────────┐      ┌─────────────┐      ┌─────────────┐
    │   Node 1    │      │   Node 2    │      │   Node 3    │
    │             │      │             │      │             │
    │ ┌─────────┐ │      │ ┌─────────┐ │      │ ┌─────────┐ │
    │ │Shard 0P │ │      │ │Shard 1P │ │      │ │Shard 2P │ │
    │ │(primary)│ │      │ │(primary)│ │      │ │(primary)│ │
    │ └─────────┘ │      │ └─────────┘ │      │ └─────────┘ │
    │ ┌─────────┐ │      │ ┌─────────┐ │      │ ┌─────────┐ │
    │ │Shard 1R │ │      │ │Shard 2R │ │      │ │Shard 0R │ │
    │ │(replica)│ │      │ │(replica)│ │      │ │(replica)│ │
    │ └─────────┘ │      │ └─────────┘ │      │ └─────────┘ │
    └─────────────┘      └─────────────┘      └─────────────┘

P = Primary Shard (accepts writes)
R = Replica Shard (serves reads, provides redundancy)

SEARCH FLOW:
1. Query arrives at coordinating node
2. Coordinator broadcasts query to one shard per shard-group
3. Each shard executes query locally, returns top-N results
4. Coordinator merges results, re-ranks, returns final top-N


INDEXING FLOW:
                                ┌─────────────────────────┐
Document ───▶ hash(doc_id) ───▶│ shard = hash % n_shards │
                                └───────────┬─────────────┘
                                            │
                                            ▼
                              Route to correct primary shard
                                            │
                                            ▼
                              Primary indexes, then replicates
                              to all replica shards

Concept Summary Table

Concept Cluster What You Need to Internalize
Inverted Index The fundamental data structure: maps terms → documents. Everything else is optimization on top of this.
Text Analysis Raw text must be tokenized, normalized, stemmed. The same pipeline runs at index AND query time.
Postings Lists Sorted lists of doc IDs with optional positions and term frequencies. Compression is critical.
Term Dictionary How to look up millions of terms fast: tries, FSTs, or sorted arrays with binary search.
Ranking (TF-IDF/BM25) Term frequency matters, but with diminishing returns. Rare terms are more valuable than common ones.
Boolean Operations AND = intersection, OR = union. Skip lists accelerate intersection of long lists.
Compression Delta encoding, variable-byte encoding, roaring bitmaps. Smaller = faster (less I/O).
Segments & Merging Indexes are immutable segments. Updates create new segments; background merge consolidates.
Distributed Search Sharding partitions data; replication provides fault tolerance. Query scatter-gather pattern.
Fuzzy Matching Levenshtein distance, n-grams, and phonetic encoding handle typos and variations.

Deep Dive Reading by Concept

This section maps each concept to specific book chapters for deeper understanding. Read these before or alongside the projects.

Foundational Concepts

Concept Book & Chapter
Inverted Index basics “Introduction to Information Retrieval” by Manning, Raghavan, Schütze — Ch. 1-2 (free online at Stanford NLP)
Postings compression “Introduction to Information Retrieval” — Ch. 5: “Index Compression”
Building an index “Introduction to Information Retrieval” — Ch. 4: “Index Construction”
Text processing “Introduction to Information Retrieval” — Ch. 2.2: “Tokenization” and Ch. 2.3: “Normalization”

Ranking and Scoring

Concept Book & Chapter
TF-IDF fundamentals “Introduction to Information Retrieval” — Ch. 6: “Scoring, term weighting and the vector space model”
BM25 in depth “Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman — Ch. 7 (free PDF available)
Probabilistic retrieval “Information Retrieval: Implementing and Evaluating Search Engines” by Büttcher, Clarke, Cormack — Ch. 7

Data Structures

Concept Book & Chapter
Tries and prefix trees “Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 5.2: “Tries”
Skip lists “Algorithms, Fourth Edition” by Sedgewick & Wayne — Ch. 3.5 or see original Pugh paper
B-trees (for disk-based indexes) “Computer Systems: A Programmer’s Perspective” by Bryant & O’Hallaron — Ch. 6.4
Finite State Transducers Mike McCandless’s blog series: “Using Finite State Transducers in Lucene”

Compression

Concept Book & Chapter
Variable-byte encoding “Introduction to Information Retrieval” — Ch. 5.3: “Variable byte codes”
Delta/Gap encoding “Managing Gigabytes” by Witten, Moffat, Bell — Ch. 3
Roaring Bitmaps Daniel Lemire’s paper: “Roaring Bitmaps: Implementation of an Optimized Software Library”

Distributed Systems

Concept Book & Chapter
Distributed indexing “Introduction to Information Retrieval” — Ch. 4.4: “Distributed indexing”
Sharding strategies “Designing Data-Intensive Applications” by Martin Kleppmann — Ch. 6: “Partitioning”
Replication “Designing Data-Intensive Applications” — Ch. 5: “Replication”

Essential Reading Order

Phase 1: Foundations (Week 1-2)

  1. “Introduction to Information Retrieval” Ch. 1-2 (inverted index basics)
  2. “Introduction to Information Retrieval” Ch. 5 (compression)
  3. “Introduction to Information Retrieval” Ch. 6 (TF-IDF, vector space model)

Phase 2: Implementation Details (Week 3-4)

  1. “Search Engines: Information Retrieval in Practice” Ch. 4-5 (indexing)
  2. “Information Retrieval: Implementing and Evaluating Search Engines” Ch. 5-6 (query processing)
  3. “Algorithms” Ch. 5.2 (tries)

Phase 3: Production Systems (Week 5-6)

  1. “Designing Data-Intensive Applications” Ch. 5-6 (replication, sharding)
  2. Elasticsearch or Solr documentation (practical application)
  3. Lucene source code exploration

Projects

The following projects will take you from understanding the basics to being able to build production-grade search systems.


Project 1: Basic Inverted Index from Scratch

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C, Go, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Data Structures / Information Retrieval
  • Software or Tool: Custom search library
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A command-line tool that indexes a directory of text files and allows boolean queries (AND, OR, NOT) to find matching documents.

Why it teaches full-text search: This is the absolute foundation. You cannot understand Elasticsearch or Lucene without first building the core data structure yourself. You’ll feel the pain of naive approaches and appreciate why the inverted index exists.

Core challenges you’ll face:

  • Designing the index data structure → maps to understanding hash maps vs sorted arrays for term lookup
  • Tokenizing text correctly → maps to handling punctuation, case, whitespace edge cases
  • Implementing boolean query logic → maps to set operations on postings lists
  • Persisting the index to disk → maps to serialization and file format design

Key Concepts:

  • Inverted Index Structure: “Introduction to Information Retrieval” Ch. 1 - Manning et al.
  • Tokenization: “Introduction to Information Retrieval” Ch. 2.2
  • Boolean Retrieval: “Introduction to Information Retrieval” Ch. 1.3

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, understanding of hash maps/dictionaries


Real World Outcome

You’ll have a working search engine that can index thousands of files and answer queries instantly. Running it will feel like magic the first time:

Example Output:

$ ./minisearch index ~/Documents/books/
Indexing 1,247 files...
Found 89,432 unique terms
Index written to ./search.idx (2.3 MB)
Indexing complete in 3.2 seconds

$ ./minisearch query "python AND machine AND learning"
Found 23 matching documents:

  1. ~/Documents/books/ml_intro.txt
     "...using Python for machine learning applications..."

  2. ~/Documents/books/deep_learning_guide.txt
     "...Python libraries for machine learning include..."

  3. ~/Documents/books/data_science.txt
     "...machine learning pipelines in Python..."

Query executed in 0.003 seconds

$ ./minisearch query "rust OR golang"
Found 156 matching documents...

$ ./minisearch query "javascript NOT nodejs"
Found 42 matching documents...

The Core Question You’re Answering

“Why is searching through millions of documents nearly instant, while grep takes forever?”

Before you write any code, sit with this question. The naive approach (scan every document) works but is painfully slow. The inverted index flips the problem: instead of “for each document, check all terms”, it becomes “for each query term, look up matching documents directly.”


Concepts You Must Understand First

Stop and research these before coding:

  1. Hash Tables vs Sorted Arrays
    • What is the lookup time for each?
    • When would you prefer one over the other?
    • How do you handle collisions in a hash table?
    • Book Reference: “Algorithms” Ch. 3.4 - Sedgewick & Wayne
  2. Set Operations
    • How do you compute intersection of two sorted lists?
    • What’s the time complexity of union vs intersection?
    • Why do postings lists need to be sorted?
    • Book Reference: “Introduction to Information Retrieval” Ch. 1.3
  3. File I/O and Serialization
    • How do you efficiently read large files?
    • What format should your index use on disk?
    • Book Reference: Language-specific documentation

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Structure Choice
    • Should the term dictionary be a HashMap or a BTreeMap?
    • How will you store document IDs in postings lists?
    • What metadata do you need per document?
  2. Tokenization Decisions
    • How do you handle “don’t” - one token or two?
    • Should numbers be indexed?
    • What about email addresses and URLs?
  3. Query Parsing
    • How will you parse “python AND (web OR api)”?
    • Do you need operator precedence?
    • How do you handle malformed queries?

Thinking Exercise

Trace an Index Operation

Before coding, manually trace indexing these two documents:

Doc 1: "The quick brown fox"
Doc 2: "The lazy brown dog"

Draw the inverted index that results. For each term, list the document IDs.

Questions while tracing:

  • Which terms appear in both documents?
  • If you search for “brown AND quick”, which document(s) match?
  • If you search for “brown AND lazy”, which document(s) match?
  • What happens if you search for “cat”?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is an inverted index and why is it faster than linear scan?”
  2. “How would you implement AND queries on two postings lists?”
  3. “What’s the time complexity of intersecting two sorted lists?”
  4. “How does your index handle documents being added or deleted?”
  5. “What’s the difference between a term dictionary and a postings list?”

Hints in Layers

Hint 1: Starting Point Start with in-memory only. Don’t worry about disk persistence until boolean queries work perfectly in memory.

Hint 2: Data Structure Use HashMap<String, Vec<u32>> where the key is a term and the value is a sorted list of document IDs.

Hint 3: Intersection Algorithm To intersect two sorted lists, use two pointers starting at the beginning. Advance the pointer pointing to the smaller value. When they’re equal, you found a match.

Hint 4: Testing Create a small test corpus of 5-10 documents with known content. Write queries where you know the expected results.


Books That Will Help

Topic Book Chapter
Inverted index fundamentals “Introduction to Information Retrieval” Ch. 1-2
Hash tables “Algorithms” by Sedgewick Ch. 3.4
File I/O in Rust “The Rust Programming Language” Ch. 12

Learning milestones:

  1. Index builds correctly → You understand the inverted index structure
  2. Boolean queries return correct results → You understand set operations
  3. Index persists to disk and reloads → You understand serialization

Project 2: Text Analyzer with Stemming and Stop Words

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, Go, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: NLP / Text Processing
  • Software or Tool: Text processing library
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A text analysis pipeline that tokenizes text, removes stop words, and applies stemming—then visualize how it transforms documents and improves search recall.

Why it teaches full-text search: Search quality depends heavily on text normalization. Without stemming, a search for “running” won’t find documents containing “run” or “runs”. This project makes you understand why Elasticsearch has analyzers.

Core challenges you’ll face:

  • Implementing the Porter Stemmer → maps to understanding rule-based string transformations
  • Building a configurable stop word list → maps to balancing precision vs recall
  • Handling edge cases in tokenization → maps to real-world text messiness
  • Measuring the impact on search quality → maps to understanding why we do this

Key Concepts:

  • Tokenization: “Introduction to Information Retrieval” Ch. 2.2
  • Stemming Algorithms: “Introduction to Information Retrieval” Ch. 2.3
  • Stop Words: “Introduction to Information Retrieval” Ch. 2.2.2

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 (Basic Inverted Index)


Real World Outcome

You’ll see exactly how text analysis transforms raw text, and measure the impact on search:

Example Output:

$ ./analyzer analyze "The cats are running quickly through the gardens!"

=== TEXT ANALYSIS PIPELINE ===

Original: "The cats are running quickly through the gardens!"

Step 1 - Tokenization:
  ["The", "cats", "are", "running", "quickly", "through", "the", "gardens", "!"]

Step 2 - Lowercasing:
  ["the", "cats", "are", "running", "quickly", "through", "the", "gardens", "!"]

Step 3 - Punctuation Removal:
  ["the", "cats", "are", "running", "quickly", "through", "the", "gardens"]

Step 4 - Stop Word Removal:
  Removed: ["the", "are", "through", "the"]
  Remaining: ["cats", "running", "quickly", "gardens"]

Step 5 - Stemming (Porter):
  "cats""cat"
  "running""run"
  "quickly""quick"
  "gardens""garden"

Final Tokens: ["cat", "run", "quick", "garden"]

$ ./analyzer compare-search "run" --with-stemming --without-stemming

=== SEARCH COMPARISON ===

Query: "run"

WITHOUT Stemming:
  Matches: 12 documents
  - doc1: "...to run the program..."
  - doc2: "...we run tests..."

WITH Stemming:
  Matches: 47 documents (291% more!)
  - doc1: "...to run the program..."
  - doc2: "...we run tests..."
  - doc3: "...he was running late..."    ← NEW MATCH
  - doc4: "...the runner crossed..."     ← NEW MATCH
  - doc5: "...runs every morning..."     ← NEW MATCH

The Core Question You’re Answering

“Why does searching for ‘run’ also find documents containing ‘running’, ‘runs’, and ‘runner’?”

The answer is stemming—reducing words to their root form so that morphological variants match. But be careful: over-aggressive stemming can hurt precision (e.g., “universal” and “university” both stem to “univers”).


Concepts You Must Understand First

Stop and research these before coding:

  1. The Porter Stemmer Algorithm
    • What are the five steps of the algorithm?
    • What is a “measure” of a word?
    • Why does suffix order matter?
    • Book Reference: Original Porter paper (1980) or “Introduction to IR” Ch. 2.3
  2. Stop Words Trade-offs
    • Why are stop words removed?
    • What happens if you search for “to be or not to be”?
    • When should you NOT remove stop words?
    • Book Reference: “Introduction to Information Retrieval” Ch. 2.2.2
  3. Unicode and Text Encoding
    • How do you handle accented characters?
    • What about CJK (Chinese, Japanese, Korean) text?
    • Book Reference: “Fluent Python” Ch. 4 (for Python) or equivalent

Questions to Guide Your Design

Before implementing, think through these:

  1. Pipeline Architecture
    • Should each step be a separate function that chains?
    • How do you make the pipeline configurable?
    • Should you preserve the original token for debugging?
  2. Stemmer Implementation
    • Should you implement Porter from scratch or use a library?
    • How do you test correctness?
    • What about words that shouldn’t be stemmed?
  3. Language Support
    • Is your tokenizer English-only?
    • How would you handle German compound words?
    • What about right-to-left languages?

Thinking Exercise

Manual Stemming Practice

Apply the Porter Stemmer rules to these words (trace each step):

"generalizations"
"relational"
"conditional"
"engineering"
"adjustable"

Questions while tracing:

  • Which suffix is removed first?
  • Does the word meet the “measure” requirement for each rule?
  • What is the final stem for each word?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What’s the difference between stemming and lemmatization?”
  2. “Why might you choose not to use stemming?”
  3. “How do stop words affect search precision and recall?”
  4. “What problems does aggressive tokenization cause?”
  5. “How would you handle multilingual documents?”

Hints in Layers

Hint 1: Start Simple Implement tokenization first with just whitespace splitting. Add complexity incrementally.

Hint 2: Use Reference Data The NLTK library has stop word lists and a Porter stemmer implementation you can use to verify your results.

Hint 3: Porter Stemmer Don’t implement Porter from scratch on your first try. Use a library first, then implement your own and compare outputs.

Hint 4: Measure Impact Build a test corpus, run queries with and without each pipeline step, and measure precision/recall changes.


Books That Will Help

Topic Book Chapter
Text normalization “Introduction to Information Retrieval” Ch. 2
Porter Stemmer Original Porter paper (1980) Full paper
NLP fundamentals “Speech and Language Processing” by Jurafsky Ch. 2

Learning milestones:

  1. Tokenizer handles edge cases correctly → You understand text segmentation
  2. Stemmer produces correct outputs → You understand morphological analysis
  3. Search recall improves measurably → You understand why analysis matters

Project 3: TF-IDF Ranking Engine

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, Go, Julia
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Information Retrieval / Linear Algebra
  • Software or Tool: Search ranking library
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: A search engine that ranks results by relevance using TF-IDF scoring, showing why some documents match better than others.

Why it teaches full-text search: Boolean search gives you matching documents, but in what ORDER? TF-IDF is the foundation of relevance ranking—understanding it unlocks understanding of BM25, which powers Elasticsearch and most modern search engines.

Core challenges you’ll face:

  • Computing term frequencies efficiently → maps to understanding index augmentation
  • Calculating IDF across the corpus → maps to global vs local statistics
  • Ranking multi-term queries → maps to vector space model fundamentals
  • Handling the “IDF explosion” for rare terms → maps to practical scoring issues

Key Concepts:

  • TF-IDF Formula: “Introduction to Information Retrieval” Ch. 6.2
  • Vector Space Model: “Introduction to Information Retrieval” Ch. 6.3
  • Cosine Similarity: “Introduction to Information Retrieval” Ch. 6.3.3

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 1 (Basic Inverted Index), basic linear algebra


Real World Outcome

You’ll see exactly WHY one document ranks higher than another, with full scoring breakdowns:

Example Output:

$ ./tfidf-search query "python machine learning tutorial"

=== QUERY: "python machine learning tutorial" ===

IDF Values (across 10,000 documents):
  "python"   → IDF: 2.30 (appears in 500 docs)
  "machine"  → IDF: 3.45 (appears in 35 docs)
  "learning" → IDF: 2.89 (appears in 130 docs)
  "tutorial" → IDF: 4.12 (appears in 8 docs)   ← Most distinctive!

Top 5 Results:

#1. python_ml_tutorial.txt (Score: 18.72)
    ├── TF("python")=8   × IDF(2.30) = 18.40
    ├── TF("machine")=5  × IDF(3.45) = 17.25
    ├── TF("learning")=6 × IDF(2.89) = 17.34
    └── TF("tutorial")=3 × IDF(4.12) = 12.36

#2. ml_basics.txt (Score: 14.23)
    ├── TF("python")=2   × IDF(2.30) = 4.60
    ├── TF("machine")=12 × IDF(3.45) = 41.40  ← High TF!
    ├── TF("learning")=10× IDF(2.89) = 28.90
    └── TF("tutorial")=0 × IDF(4.12) = 0.00   ← Missing term

#3. python_intro.txt (Score: 9.87)
    ...

Note: "tutorial" has highest IDF because it's rare in the corpus.
      Documents mentioning "tutorial" get a significant boost!

The Core Question You’re Answering

“If two documents both contain my search term, which one is MORE relevant?”

TF-IDF answers this by combining two intuitions: (1) the more times a term appears in a document, the more relevant it is, and (2) the rarer a term is across all documents, the more distinctive and valuable it is for ranking.


Concepts You Must Understand First

Stop and research these before coding:

  1. Logarithms in IDF
    • Why do we use log in the IDF formula?
    • What happens if we don’t use log?
    • What’s the base of the logarithm?
    • Book Reference: “Introduction to Information Retrieval” Ch. 6.2.1
  2. Vector Space Model
    • How is a document represented as a vector?
    • What is cosine similarity and why use it?
    • Why is Euclidean distance not ideal for documents?
    • Book Reference: “Introduction to Information Retrieval” Ch. 6.3
  3. Term Frequency Variants
    • What is raw TF vs log TF?
    • What is augmented TF?
    • When should you use each?
    • Book Reference: “Introduction to Information Retrieval” Ch. 6.4

Questions to Guide Your Design

Before implementing, think through these:

  1. Storage Decisions
    • Where do you store term frequencies?
    • Do you precompute IDF or calculate on the fly?
    • How do you handle documents of very different lengths?
  2. Multi-term Queries
    • How do you combine scores for multiple query terms?
    • Should all terms be required or just some?
    • How do you handle query terms not in the index?
  3. Efficiency
    • Can you score documents incrementally?
    • How do you avoid scoring every document in the corpus?
    • What’s your top-K extraction strategy?

Thinking Exercise

Manual TF-IDF Calculation

Given this tiny corpus:

Doc1: "machine learning machine learning algorithms"
Doc2: "deep learning neural networks"
Doc3: "machine learning for beginners"

Calculate the TF-IDF score for the query “machine learning” for each document.

Questions while calculating:

  • What is TF(“machine”) for each document?
  • What is IDF(“machine”) and IDF(“learning”)?
  • Which document scores highest and why?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain TF-IDF. Why the log in IDF?”
  2. “What are the limitations of TF-IDF?”
  3. “How does document length affect TF-IDF scores?”
  4. “What is the vector space model?”
  5. “How would you efficiently find the top-10 results from a million documents?”

Hints in Layers

Hint 1: Precompute What You Can IDF values don’t change unless the corpus changes. Compute them once and store them.

Hint 2: Augment Your Index Store term frequencies in your postings list alongside document IDs: (doc_id, term_frequency).

Hint 3: Top-K Efficiency Use a min-heap of size K. Only keep the K highest-scoring documents as you scan.

Hint 4: Score Decomposition For debugging, output the score contribution of each term separately.


Books That Will Help

Topic Book Chapter
TF-IDF fundamentals “Introduction to Information Retrieval” Ch. 6
Vector space model “Introduction to Information Retrieval” Ch. 6.3
Math foundations “Linear Algebra Done Right” by Axler Ch. 1-3

Learning milestones:

  1. Scores match manual calculation → You understand the TF-IDF formula
  2. Rankings feel intuitive → You understand the intuition behind TF-IDF
  3. You can explain why doc A ranks above doc B → You truly get it

Project 4: BM25 Scoring Implementation

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, Python, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Information Retrieval / Probabilistic Models
  • Software or Tool: Production search library
  • Main Book: “Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman

What you’ll build: A BM25 scoring engine that demonstrates the improvements over TF-IDF—term frequency saturation and document length normalization—with tunable parameters.

Why it teaches full-text search: BM25 is what Elasticsearch, Lucene, and most modern search engines actually use. Understanding the k1 and b parameters gives you power to tune search quality in production systems.

Core challenges you’ll face:

  • Implementing term frequency saturation → maps to understanding asymptotic behavior
  • Computing average document length → maps to corpus statistics management
  • Tuning k1 and b parameters → maps to understanding their effects on ranking
  • Comparing BM25 vs TF-IDF empirically → maps to evaluation methodology

Key Concepts:

  • BM25 Formula: “Search Engines: Information Retrieval in Practice” Ch. 7.1
  • Probabilistic Retrieval: “Introduction to Information Retrieval” Ch. 11
  • Parameter Tuning: Elastic blog: “Practical BM25”

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 3 (TF-IDF Ranking Engine)


Real World Outcome

You’ll see exactly how BM25 differs from TF-IDF and be able to tune parameters:

Example Output:

$ ./bm25-search query "python programming" --show-comparison

=== BM25 vs TF-IDF Comparison ===

Corpus Stats:
  Total documents: 10,000
  Average doc length: 450 tokens

BM25 Parameters: k1=1.2, b=0.75

Query: "python programming"

Document: python_guide.txt
  Length: 2,000 tokens (4.4x average - LONG document)
  TF("python"): 50 occurrences

  TF-IDF Score: 115.0
    └── Problem: Long doc has high TF, unfair advantage!

  BM25 Score: 8.72
    ├── TF saturation: 50 / (50 + 1.2) = 0.98 (saturated!)
    └── Length penalty: 1 - 0.75 + 0.75 × (2000/450) = 2.58

Document: python_tips.txt
  Length: 200 tokens (0.44x average - SHORT document)
  TF("python"): 8 occurrences

  TF-IDF Score: 18.4
    └── Lower because fewer occurrences

  BM25 Score: 7.89
    ├── TF saturation: 8 / (8 + 1.2) = 0.87
    └── Length BOOST: 1 - 0.75 + 0.75 × (200/450) = 0.58

Result: BM25 ranks the shorter, more focused document higher!

$ ./bm25-search tune --corpus ./test-data --queries ./test-queries

=== PARAMETER TUNING ===

Testing k1 values: [0.5, 1.0, 1.2, 1.5, 2.0]
Testing b values: [0.0, 0.25, 0.5, 0.75, 1.0]

Best parameters found:
  k1 = 1.2 (Mean Average Precision: 0.723)
  b  = 0.75 (Mean Average Precision: 0.723)

Parameter sensitivity analysis:
  k1: Low sensitivity (±0.02 MAP change per 0.5 k1 change)
  b:  High sensitivity (±0.08 MAP change per 0.25 b change)

The Core Question You’re Answering

“Why does Elasticsearch use BM25 instead of TF-IDF, and what do those k1 and b parameters actually control?”

BM25 fixes two TF-IDF problems: (1) TF grows linearly forever, but a word appearing 100 times isn’t 10x more relevant than appearing 10 times, and (2) longer documents naturally have higher TF, giving them an unfair advantage.


Concepts You Must Understand First

Stop and research these before coding:

  1. Term Frequency Saturation
    • What is the saturation curve for TF/(TF + k1)?
    • What happens as k1 → 0 or k1 → ∞?
    • How does k1 affect short vs long queries?
    • Book Reference: “Practical BM25” blog series by Elastic
  2. Length Normalization
    • Why is document length normalization important?
    • What does b=0 mean? What about b=1?
    • When would you want less length normalization?
    • Book Reference: “Search Engines: Information Retrieval in Practice” Ch. 7
  3. Probabilistic Retrieval Foundations
    • Where does BM25 come from theoretically?
    • What is the Robertson-Spärck Jones framework?
    • Book Reference: “Introduction to Information Retrieval” Ch. 11

Questions to Guide Your Design

Before implementing, think through these:

  1. Parameter Storage
    • Are k1 and b per-field or global?
    • How do you handle different fields with different characteristics?
    • Should parameters be tunable at query time?
  2. Corpus Statistics
    • Where do you store average document length?
    • What happens when you add/remove documents?
    • Do you recompute on every update?
  3. Evaluation
    • How do you know BM25 is “better” than TF-IDF?
    • What test queries and relevance judgments do you need?
    • How do you compute Mean Average Precision?

Thinking Exercise

Parameter Effect Analysis

Given these BM25 parameters and document:

  • k1 = 1.5
  • b = 0.75
  • avgdl = 500
  • Document length = 1000 (2x average)
  • TF = 10

Calculate the BM25 term score step by step.

Now recalculate with:

  • Same document but length = 250 (0.5x average)

Questions while calculating:

  • How much does length normalization change the score?
  • What if b = 0 (no length normalization)?
  • At what TF does saturation become significant?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What are the k1 and b parameters in BM25 and what do they control?”
  2. “Why does BM25 use term frequency saturation?”
  3. “When would you increase or decrease the b parameter?”
  4. “How do you tune BM25 parameters for a specific corpus?”
  5. “What’s the relationship between BM25 and TF-IDF?”

Hints in Layers

Hint 1: Start with TF-IDF Build on your TF-IDF implementation. BM25 is a modification, not a rewrite.

Hint 2: Compute avgdl Once Store average document length when building the index. Update incrementally if possible.

Hint 3: Visualize Saturation Plot the TF saturation curve for different k1 values. It builds intuition.

Hint 4: Use TREC Data The TREC conference provides standard test collections with relevance judgments for evaluation.


Books That Will Help

Topic Book Chapter
BM25 derivation “Search Engines: Information Retrieval in Practice” Ch. 7
Probabilistic IR “Introduction to Information Retrieval” Ch. 11
Evaluation metrics “Introduction to Information Retrieval” Ch. 8

Learning milestones:

  1. BM25 formula implemented correctly → You understand the math
  2. You can explain what k1 and b do → You have intuition
  3. You can tune parameters on a test corpus → You’re ready for production

Project 5: Positional Index and Phrase Queries

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Structures / Information Retrieval
  • Software or Tool: Search engine library
  • Main Book: “Introduction to Information Retrieval” by Manning, Raghavan, Schütze

What you’ll build: An inverted index that stores word positions, enabling phrase queries (“machine learning”), proximity queries (machine NEAR learning), and highlighting of matching text.

Why it teaches full-text search: Basic inverted indexes only tell you which documents contain a word—not WHERE in the document. Positional indexes unlock phrase search, which is essential for precise queries. This is how Google knows “New York” is different from documents containing “new” and “York” separately.

Core challenges you’ll face:

  • Storing positions efficiently → maps to understanding variable-length encoding
  • Implementing phrase matching → maps to positional intersection algorithms
  • Handling near/proximity queries → maps to slop and distance calculations
  • Highlighting matched text → maps to returning position metadata to UI

Key Concepts:

  • Positional Indexes: “Introduction to Information Retrieval” Ch. 2.4
  • Phrase Queries: “Introduction to Information Retrieval” Ch. 2.4.2
  • Position Compression: “Introduction to Information Retrieval” Ch. 5.3

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 1 (Basic Inverted Index), Project 4 (BM25)


Real World Outcome

You’ll be able to search for exact phrases and see exactly where they appear:

Example Output:

$ ./phrase-search index ~/Documents/books/

Indexing with positions enabled...
1,247 files indexed
89,432 unique terms
Average positions per term: 4.7
Index size: 8.9 MB (with positions) vs 2.3 MB (without)

$ ./phrase-search query '"machine learning"'

=== PHRASE QUERY: "machine learning" ===

Found in 47 documents (exact phrase match):

#1. ml_intro.txt (Score: 12.4)
    Position: offset 234, line 15
    Context: "...the field of [machine learning] has grown..."

#2. deep_learning.txt (Score: 11.2)
    Position: offset 89, line 5
    Context: "...[machine learning] algorithms can..."

$ ./phrase-search query "machine NEAR/3 learning"

=== PROXIMITY QUERY: machine within 3 words of learning ===

Found in 89 documents:

#1. ai_basics.txt
    "machine-based learning" (distance: 1)
    "machine and deep learning" (distance: 2)

#2. tutorial.txt
    "learning about machine" (distance: 2, reversed)

The Core Question You’re Answering

“How does Google know that searching for ‘machine learning’ shouldn’t return every document that has ‘machine’ somewhere and ‘learning’ somewhere else?”

The answer is positional indexing. By storing WHERE each term appears in each document, you can verify that terms appear consecutively (for phrases) or within a specified distance (for proximity queries).


Concepts You Must Understand First

Stop and research these before coding:

  1. Positional Postings Structure
    • What does a positional posting look like?
    • How much larger is a positional index vs non-positional?
    • When are positions worth the space cost?
    • Book Reference: “Introduction to Information Retrieval” Ch. 2.4
  2. Phrase Matching Algorithm
    • How do you check if positions are consecutive?
    • What’s the time complexity of phrase matching?
    • How do you handle multi-word phrases efficiently?
    • Book Reference: “Introduction to Information Retrieval” Ch. 2.4.2
  3. Variable-Length Integer Encoding
    • Why use variable-length encoding for positions?
    • What is delta encoding for positions?
    • How much space does this save?
    • Book Reference: “Introduction to Information Retrieval” Ch. 5.3

Questions to Guide Your Design

Before implementing, think through these:

  1. Data Structure Design
    • How do you represent positions: linked list, array, or something else?
    • Do you store absolute positions or deltas?
    • What about positions across different fields?
  2. Query Processing
    • How do you parse phrase queries (with quotes)?
    • How do you handle NEAR/n queries?
    • What about mixed boolean and phrase queries?
  3. Memory and Disk
    • Positional indexes are much larger—how do you handle memory pressure?
    • Should you compress positions differently than doc IDs?

Thinking Exercise

Manual Phrase Matching

Given these positional postings:

"machine" in Doc1: positions [5, 23, 89]
"learning" in Doc1: positions [6, 45, 90]

Find all positions where “machine learning” appears as a phrase.

Questions while tracing:

  • Which position pairs satisfy pos_learning = pos_machine + 1?
  • What about “learning machine” (reversed)?
  • What about “machine … learning” with distance ≤ 3?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “How do positional indexes differ from regular inverted indexes?”
  2. “What’s the time complexity of phrase query matching?”
  3. “How much larger is a positional index, typically?”
  4. “How would you implement proximity/NEAR queries?”
  5. “How do you handle phrase queries that span field boundaries?”

Hints in Layers

Hint 1: Start Simple First implement phrases for single-document testing before scaling to a corpus.

Hint 2: Position Representation Store positions as: HashMap<Term, HashMap<DocId, Vec<Position>>>.

Hint 3: Phrase Algorithm For phrase “A B C”: find docs containing all three, then for each doc, find position triplets where pos_B = pos_A + 1 AND pos_C = pos_B + 1.

Hint 4: Delta Encoding Store position deltas: [5, 23, 89] → [5, 18, 66]. Smaller numbers compress better.


Books That Will Help

Topic Book Chapter
Positional indexes “Introduction to Information Retrieval” Ch. 2.4
Position compression “Introduction to Information Retrieval” Ch. 5.3
Phrase queries “Information Retrieval: Implementing and Evaluating” Ch. 6

Learning milestones:

  1. Phrase queries work correctly → You understand positional matching
  2. Proximity queries work → You understand distance calculations
  3. Highlighting shows exact positions → You understand full position propagation

Project 6: Index Compression with Variable-Byte and Delta Encoding

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: C
  • Alternative Programming Languages: Rust, Go, Zig
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Compression / Information Theory
  • Software or Tool: Compression library
  • Main Book: “Managing Gigabytes” by Witten, Moffat, Bell

What you’ll build: A compression layer for your inverted index that reduces size by 50-80% while maintaining fast decompression, using variable-byte encoding, delta encoding, and optionally roaring bitmaps.

Why it teaches full-text search: Real search engines don’t store raw integers—they’d run out of memory! Understanding compression is understanding how Lucene, Elasticsearch, and PostgreSQL GIN indexes achieve their performance. Compression isn’t just about space—smaller data means faster I/O.

Core challenges you’ll face:

  • Implementing variable-byte encoding → maps to bit manipulation and byte-level operations
  • Applying delta encoding to sorted lists → maps to understanding why sorted data compresses well
  • Benchmarking compression vs decompression speed → maps to understanding real-world trade-offs
  • Integrating with your index → maps to API design for transparent compression

Key Concepts:

  • Variable-Byte Encoding: “Introduction to Information Retrieval” Ch. 5.3.1
  • Delta/Gap Encoding: “Introduction to Information Retrieval” Ch. 5.3
  • Gamma and Golomb Codes: “Managing Gigabytes” Ch. 3

Difficulty: Expert Time estimate: 2 weeks Prerequisites: Project 1, understanding of binary representations


Real World Outcome

You’ll see dramatic space savings and understand the speed trade-offs:

Example Output:

$ ./index-compress analyze ./my_index

=== INDEX COMPRESSION ANALYSIS ===

Original index: 45.2 MB
Vocabulary: 125,000 terms
Total postings: 8.3 million

Compression Results:

Method 1: Variable-Byte Encoding
  Compressed size: 23.1 MB (48.9% reduction)
  Encode speed: 45 MB/s
  Decode speed: 380 MB/s

Method 2: Delta + Variable-Byte
  Compressed size: 15.8 MB (65.1% reduction)
  Encode speed: 42 MB/s
  Decode speed: 350 MB/s

Method 3: Delta + Simple-9
  Compressed size: 12.3 MB (72.8% reduction)
  Encode speed: 38 MB/s
  Decode speed: 520 MB/s  ← Faster decode!

Method 4: Roaring Bitmaps
  Compressed size: 11.9 MB (73.7% reduction)
  Encode speed: 52 MB/s
  Decode speed: 890 MB/s  ← Fastest!
  (Best for intersection operations)

Sample postings list compression:
  Term: "the" (appears in 95% of docs)
  Original: 450 KB
  Delta+VByte: 28 KB (93.8% reduction!)

  Term: "xylophone" (appears in 3 docs)
  Original: 12 bytes
  Delta+VByte: 8 bytes (33% reduction)

$ ./index-compress benchmark

Query: "machine learning"
  Uncompressed index: 2.3 ms (disk read: 1.8 ms)
  Compressed index:   1.1 ms (disk read: 0.4 ms, decompress: 0.3 ms)

Result: Compression makes queries FASTER due to less I/O!

The Core Question You’re Answering

“How can search engines store billions of document IDs without running out of memory?”

The answer is compression. By exploiting the sorted nature of postings lists (using delta encoding) and the fact that small numbers are more common (using variable-length encoding), you can reduce index size by 70-90%.


Concepts You Must Understand First

Stop and research these before coding:

  1. Variable-Byte (VByte) Encoding
    • How do you encode an integer using continuation bits?
    • What’s the maximum overhead for small vs large numbers?
    • Why use the high bit as a continuation flag?
    • Book Reference: “Introduction to Information Retrieval” Ch. 5.3.1
  2. Delta/Gap Encoding
    • Why does delta encoding help sorted lists?
    • What’s the distribution of gap sizes in real postings?
    • When does delta encoding hurt?
    • Book Reference: “Introduction to Information Retrieval” Ch. 5.3
  3. Block-Based Compression
    • What are Simple-9 and PForDelta?
    • Why compress in blocks rather than one-at-a-time?
    • How does SIMD acceleration help?
    • Book Reference: “Managing Gigabytes” Ch. 3 or Lemire’s papers

Questions to Guide Your Design

Before implementing, think through these:

  1. API Design
    • Should compression be transparent to the index user?
    • How do you handle random access to compressed data?
    • Do you compress in memory, on disk, or both?
  2. Block Size
    • What’s the trade-off for different block sizes?
    • How do you handle the last partial block?
    • What metadata do you need per block?
  3. Benchmarking
    • How do you measure compression ratio fairly?
    • What’s the right metric: space, encode speed, or decode speed?
    • How do you test with realistic data?

Thinking Exercise

Manual VByte Encoding

Encode these integers using variable-byte encoding (7 bits of data per byte, high bit = continuation):

5, 127, 128, 16384, 2097152

Questions while encoding:

  • How many bytes does each number require?
  • What’s the pattern for numbers crossing byte boundaries?
  • What’s the maximum number encodable in 4 bytes?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain variable-byte encoding and when you’d use it.”
  2. “Why does delta encoding help compress postings lists?”
  3. “What’s the trade-off between compression ratio and decompression speed?”
  4. “How do roaring bitmaps work and when are they useful?”
  5. “How would you handle compressed random access?”

Hints in Layers

Hint 1: Start with VByte Implement variable-byte encoding first. It’s simpler than block-based methods.

Hint 2: Test Incrementally Encode and decode single integers before moving to arrays.

Hint 3: Use Benchmarks Create benchmark data with known characteristics (uniform, skewed, etc.).

Hint 4: Consider Libraries After building your own, compare against streamvbyte or roaring libraries.


Books That Will Help

Topic Book Chapter
Variable-byte encoding “Introduction to Information Retrieval” Ch. 5.3
Advanced compression “Managing Gigabytes” Ch. 3
Roaring bitmaps Lemire paper “Roaring Bitmaps”

Learning milestones:

  1. VByte encoding/decoding works → You understand variable-length encoding
  2. Delta + VByte compresses significantly → You understand gap encoding
  3. Compressed queries are faster → You understand the I/O trade-off

Project 7: Fuzzy Search with Levenshtein Automata

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C++, Go, Java
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Automata Theory / String Algorithms
  • Software or Tool: Fuzzy search library
  • Main Book: “Flexible Pattern Matching in Strings” by Navarro & Raffinot

What you’ll build: A fuzzy search engine that finds matches with spelling errors using Levenshtein distance, efficiently via Levenshtein automata that integrate with your FST term dictionary.

Why it teaches full-text search: Typo tolerance is essential for good search UX. The naive approach (compute edit distance to every term) is O(n) where n is vocabulary size—too slow! Levenshtein automata let you do fuzzy matching in O(m) where m is query length.

Core challenges you’ll face:

  • Understanding Levenshtein distance → maps to dynamic programming foundations
  • Building a Levenshtein automaton → maps to finite automata construction
  • Integrating with FST → maps to automaton intersection algorithms
  • Ranking fuzzy matches → maps to combining edit distance with relevance

Key Concepts:

  • Levenshtein Distance: Classic dynamic programming algorithm
  • Levenshtein Automata: Schulz & Mihov paper (2002)
  • FST Intersection: Lucene’s FuzzyQuery implementation

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Project 1, Project 6, automata theory basics


Real World Outcome

Your search engine will gracefully handle typos:

Example Output:

$ ./fuzzy-search query "elefant" --max-edits 2

=== FUZZY QUERY: "elefant" (max 2 edits) ===

Automaton constructed: 847 states, 2,341 transitions
FST traversal: visited 1,247 of 125,000 terms (1.0%)

Matches found (sorted by edit distance, then score):

Edit Distance 1:
  "elephant" → 45 documents (Score: 12.3)
    └── 1 substitution: a→e at position 3

Edit Distance 2:
  "elegant" → 23 documents (Score: 8.7)
    └── 1 substitution + 1 deletion
  "elephants" → 38 documents (Score: 10.1)
    └── 1 substitution + 1 insertion

Query time: 0.8 ms (vs 340 ms for naive approach)

$ ./fuzzy-search query "progamming" --auto-correct

=== AUTO-CORRECT SUGGESTION ===

Did you mean: "programming"? (1 edit: insert 'r')

Proceeding with corrected query...
Found 1,247 documents for "programming"

The Core Question You’re Answering

“How can a search engine instantly find ‘elephant’ when I type ‘elefant’, without checking every word in its vocabulary?”

Levenshtein automata are the answer. Instead of computing edit distance to each of 100,000 terms, you build a finite automaton that accepts all strings within edit distance N of your query, then intersect it with your term dictionary in a single traversal.


Concepts You Must Understand First

Stop and research these before coding:

  1. Levenshtein Distance
    • What operations does edit distance count?
    • What’s the dynamic programming algorithm?
    • What’s the time/space complexity?
    • Book Reference: “Introduction to Algorithms” or any algorithms textbook
  2. Finite Automata (NFAs and DFAs)
    • What’s the difference between NFA and DFA?
    • How do you convert NFA to DFA?
    • What is automaton intersection?
    • Book Reference: “Introduction to the Theory of Computation” by Sipser
  3. Levenshtein Automata Construction
    • How do you build an automaton that accepts all strings within distance N?
    • What’s the state structure?
    • Why is this more efficient than pairwise comparison?
    • Book Reference: Schulz & Mihov (2002) paper

Questions to Guide Your Design

Before implementing, think through these:

  1. Automaton Size
    • How does automaton size grow with query length and edit distance?
    • Should you limit max edit distance?
    • How do you handle very long queries?
  2. Integration Strategy
    • How do you traverse FST and automaton simultaneously?
    • What’s the intersection algorithm?
    • How do you collect matching terms?
  3. Ranking Fuzzy Matches
    • Should exact matches always rank first?
    • How do you combine edit distance with BM25 score?
    • What weight does each edit add?

Thinking Exercise

Manual Edit Distance Calculation

Calculate the Levenshtein distance between:

  • “kitten” and “sitting”
  • “algorithm” and “altruistic”

Fill in the DP table for the first pair.

Questions while calculating:

  • Which cells represent substitutions vs insertions vs deletions?
  • What’s the minimum sequence of operations?
  • How would an automaton represent “all strings within distance 2 of kitten”?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “Explain Levenshtein distance and the DP algorithm.”
  2. “Why are Levenshtein automata more efficient than pairwise comparison?”
  3. “How would you implement typo tolerance in a search engine?”
  4. “What’s the relationship between edit distance and search quality?”
  5. “How do you balance fuzzy matching with relevance ranking?”

Hints in Layers

Hint 1: Start with DP Implement the basic Levenshtein DP algorithm first. Understand it deeply.

Hint 2: Use Existing Libraries The fst crate in Rust has Levenshtein automata built in. Study its API.

Hint 3: Limit Edit Distance Edit distance 2 catches most typos. Beyond that, you get too many false positives.

Hint 4: Study Lucene Lucene’s FuzzyQuery is a masterclass implementation. Read the source.


Books That Will Help

Topic Book Chapter
Edit distance DP “Introduction to Algorithms” (CLRS) Ch. 15
Automata theory “Introduction to Theory of Computation” Ch. 1-2
Fuzzy matching Schulz & Mihov paper “Fast String Correction”

Learning milestones:

  1. DP edit distance works → You understand the basic algorithm
  2. Automaton traversal works → You understand NFA/DFA
  3. FST intersection finds matches → You’ve implemented production-quality fuzzy search

Project 8: Build a Trie and FST from Scratch

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: C, C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Data Structures / Automata
  • Software or Tool: Prefix data structure library
  • Main Book: “Algorithms, Fourth Edition” by Sedgewick & Wayne

What you’ll build: A trie data structure for prefix matching, then evolve it into a Finite State Transducer (FST) that shares both prefixes AND suffixes—the data structure Lucene uses for its term dictionary.

Why it teaches full-text search: The term dictionary is consulted on every single query. It must be FAST and SMALL. Understanding tries and FSTs teaches you how Lucene achieves sub-millisecond term lookups with million-term vocabularies while using minimal RAM.

Core challenges you’ll face:

  • Implementing a basic trie → maps to recursive tree structures and pointer manipulation
  • Adding value outputs (trie → transducer) → maps to understanding FST output semantics
  • Sharing suffixes (trie → FST) → maps to understanding automaton minimization
  • Persisting to disk efficiently → maps to compact binary representation

Key Concepts:

  • Tries: “Algorithms” Ch. 5.2 by Sedgewick
  • FST Construction: Mihov & Maurel paper (2001)
  • Lucene’s FST: Mike McCandless blog series

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Basic data structures, recursion, pointers/references


Real World Outcome

You’ll have your own FST implementation that rivals Lucene’s:

Example Output:

$ ./fst-builder build --input words.txt --output words.fst

=== FST BUILD ===

Input: 235,886 words
Building trie...
  Nodes: 1,847,342
  Memory: 89.3 MB

Minimizing to FST (sharing suffixes)...
  States: 328,451 (82% reduction!)
  Arcs: 412,893
  Memory: 4.2 MB (95% reduction from naive trie!)

Writing to disk...
  File size: 3.8 MB

$ ./fst-builder lookup --fst words.fst "algorithm"
Found: "algorithm" → output: 42891 (postings offset)
Lookup time: 0.4 μs

$ ./fst-builder prefix --fst words.fst "algo"
Prefix matches (5 found):
  "algo" → 42890
  "algorithm" → 42891
  "algorithms" → 42892
  "algorithmic" → 42893
  "algorithmically" → 42894

$ ./fst-builder stats --fst words.fst

=== FST STATISTICS ===

Total states: 328,451
Accepting states: 235,886
Total arcs: 412,893
Avg arcs per state: 1.26

Compression analysis:
  Raw data (term + offset): 4.7 MB
  Trie representation: 89.3 MB (19x larger!)
  FST representation: 3.8 MB (19% smaller than raw!)

How is FST smaller than raw data?
  → Prefix sharing: "algorithm", "algorithms" share "algorithm"
  → Suffix sharing: "running", "jumping" share "ing"
  → Variable-length arc encoding

The Core Question You’re Answering

“How does Lucene look up any of its millions of terms in microseconds while using only a few MB of RAM?”

The Finite State Transducer is the answer. It’s like a trie (shares prefixes) but also shares suffixes, making it incredibly compact. And it outputs values (like file offsets) as you traverse, making it a complete term → offset map.


Concepts You Must Understand First

Stop and research these before coding:

  1. Trie Basics
    • What is the structure of a trie node?
    • How do you insert, lookup, and delete?
    • What’s the space overhead of a naive trie?
    • Book Reference: “Algorithms” Ch. 5.2 by Sedgewick
  2. Automaton Minimization
    • What makes two automaton states equivalent?
    • What’s the algorithm for minimization?
    • How much does minimization reduce size?
    • Book Reference: “Introduction to Theory of Computation” by Sipser
  3. FST Output Semantics
    • How do outputs accumulate along paths?
    • What’s the difference between FST and DAWG (FSA)?
    • How do you output shared prefixes efficiently?
    • Book Reference: Mike McCandless blog: “Using FSTs in Lucene”

Questions to Guide Your Design

Before implementing, think through these:

  1. Node Representation
    • How do you represent children: array, hashmap, or sorted list?
    • What’s the trade-off between memory and speed?
    • How do you represent arc labels efficiently?
  2. Construction Algorithm
    • Can you build incrementally or must you have all inputs upfront?
    • Lucene requires sorted input—why?
    • How do you minimize on-the-fly vs post-construction?
  3. Persistence
    • What binary format should you use?
    • How do you memory-map the FST?
    • What metadata do you need in the file header?

Thinking Exercise

Build a Mini-FST by Hand

Given these words: cat, car, card, care, careful, carefully

  1. Draw the trie
  2. Identify shared suffixes
  3. Draw the minimized FST
  4. Assign outputs (values 1-6) and show output distribution

Questions while building:

  • Which states can be merged in the trie?
  • How do you split the output between shared arcs?
  • How many states/arcs in trie vs FST?

The Interview Questions They’ll Ask

Prepare to answer these:

  1. “What is a trie and what are its time/space characteristics?”
  2. “How does an FST differ from a trie?”
  3. “Why does Lucene require sorted input for FST construction?”
  4. “How do FST outputs work and how are they distributed across arcs?”
  5. “What’s the difference between an FSA and an FST?”

Hints in Layers

Hint 1: Start with Trie Build a working trie first. Don’t worry about minimization yet.

Hint 2: Study the fst Crate Rust’s fst crate by BurntSushi is an excellent reference implementation.

Hint 3: Sorted Input Build from sorted input. This enables single-pass construction with minimization.

Hint 4: Test with Small Vocabularies Debug with 5-10 words where you can verify by hand.


Books That Will Help

Topic Book Chapter
Tries “Algorithms” by Sedgewick Ch. 5.2
Automata minimization “Theory of Computation” by Sipser Ch. 1
FST implementation McCandless blog series “Using FSTs in Lucene”

Learning milestones:

  1. Trie works for insert/lookup → You understand prefix trees
  2. FST minimizes and shares suffixes → You understand automaton minimization
  3. FST persists and memory-maps → You’re ready to build Lucene-scale systems

Project 9: Segment-Based Index with Background Merging

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, Java, C++
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 4: Expert
  • Knowledge Area: Systems Programming / Concurrency
  • Software or Tool: Search engine core
  • Main Book: “Designing Data-Intensive Applications” by Kleppmann

What you’ll build: A Lucene-style segmented index where new documents go to in-memory buffers, periodically flush to immutable segments, and background threads merge segments together.

Why it teaches full-text search: Real search engines never update documents in place—they use append-only segments. Understanding this teaches you how Elasticsearch achieves both fast indexing AND fast search, and why “near-real-time” search exists.

Core challenges you’ll face:

  • Managing concurrent reads and writes → maps to understanding lockless data structures
  • Implementing segment merging → maps to understanding merge policies
  • Handling deletes with tombstones → maps to understanding soft deletes and compaction
  • Making search see new documents → maps to understanding refresh and commit

Key Concepts:

  • LSM Trees: “Designing Data-Intensive Applications” Ch. 3
  • Lucene Segments: Lucene documentation
  • Merge Policies: “Information Retrieval: Implementing and Evaluating” Ch. 5

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-6, concurrency experience


Real World Outcome

Example Output:

$ ./segment-index start --data-dir ./index

=== SEGMENT-BASED INDEX ===

In-memory buffer: 0 documents
Segments on disk: 0
Background merger: running

$ ./segment-index add-batch ./docs/*.txt
Adding 10,000 documents...
Buffer filled, flushing to segment_001.seg
Buffer filled, flushing to segment_002.seg
10,000 documents indexed in 2.3 seconds

$ ./segment-index status

Current state:
  In-memory buffer: 423 documents (not yet searchable)
  Segments:
    segment_001.seg: 4,500 docs, 12.3 MB
    segment_002.seg: 4,500 docs, 11.9 MB
    segment_003.seg: 577 docs, 1.4 MB

  Background merge in progress:
    Merging segment_001 + segment_002 → segment_004
    Progress: 67%

$ ./segment-index refresh
Refreshing search view...
423 documents now searchable (flushed to segment_005.seg)

$ ./segment-index delete --query "status:obsolete"
Marked 234 documents as deleted (tombstoned)
Documents still in segments but filtered from results

The Core Question You’re Answering

“How does Elasticsearch add new documents instantly without rebuilding the entire index?”

The answer is segments. New documents go to an in-memory buffer, which periodically flushes to a new immutable segment on disk. Searches query ALL segments and merge results. Background merging consolidates small segments into larger ones.


Concepts You Must Understand First

  1. Log-Structured Merge Trees (LSM)
    • Why are writes faster than updates?
    • What is write amplification?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 3
  2. Segment Merging
    • Why merge segments?
    • What’s a merge policy?
    • Book Reference: Lucene documentation on TieredMergePolicy
  3. Concurrency Control
    • How do readers see consistent snapshots?
    • How do you delete without blocking readers?
    • Book Reference: “Designing Data-Intensive Applications” Ch. 7

The Interview Questions They’ll Ask

  1. “What is a segment in Lucene and why are they immutable?”
  2. “How does Elasticsearch achieve near-real-time search?”
  3. “What is a merge policy and why does it matter?”
  4. “How are deletes handled in a segment-based index?”
  5. “What’s the difference between refresh and commit?”

Learning milestones:

  1. Multi-segment search works → You understand segment architecture
  2. Background merging runs without blocking → You understand concurrency
  3. Deletes filter correctly → You understand tombstones

Project 10: PostgreSQL Full-Text Search Integration

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: SQL / Python
  • Alternative Programming Languages: Go, Rust (with postgres libs)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Databases / SQL
  • Software or Tool: PostgreSQL
  • Main Book: PostgreSQL Documentation (Ch. 12: Full Text Search)

What you’ll build: A search application using PostgreSQL’s built-in full-text search with tsvector, tsquery, GIN indexes, and custom ranking—understanding when PostgreSQL is “good enough” vs when you need Elasticsearch.

Why it teaches full-text search: PostgreSQL’s FTS implementation is elegant and powerful. Understanding it teaches you that you don’t always need a separate search engine, and shows how GIN indexes implement inverted indexes inside a relational database.

Core challenges you’ll face:

  • Understanding tsvector and tsquery → maps to tokenization and query parsing
  • Creating GIN indexes → maps to understanding inverted index storage
  • Tuning text search configuration → maps to analyzer configuration
  • Combining FTS with SQL queries → maps to practical search integration

Key Concepts:

  • tsvector/tsquery: PostgreSQL Ch. 12.1-12.2
  • GIN Indexes: PostgreSQL Ch. 12.9
  • Ranking: PostgreSQL Ch. 12.3

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic SQL, Project 1 understanding


Real World Outcome

Example Output:

-- Create a table with full-text search
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    tsv tsvector GENERATED ALWAYS AS (
        setweight(to_tsvector('english', title), 'A') ||
        setweight(to_tsvector('english', body), 'B')
    ) STORED
);

CREATE INDEX articles_tsv_idx ON articles USING GIN (tsv);

-- Search with ranking
SELECT
    id,
    title,
    ts_rank(tsv, query) AS rank,
    ts_headline('english', body, query) AS snippet
FROM
    articles,
    to_tsquery('english', 'python & machine & learning') query
WHERE
    tsv @@ query
ORDER BY
    rank DESC
LIMIT 10;

-- Results:
 id |              title              | rank  |         snippet
----+---------------------------------+-------+---------------------------
 42 | Python ML Tutorial              | 0.875 | ...using <b>Python</b>...
 17 | Machine Learning Basics         | 0.654 | ...<b>machine learning</b>...

The Core Question You’re Answering

“When do I need Elasticsearch, and when is PostgreSQL full-text search good enough?”

PostgreSQL FTS is sufficient for many applications: blogs, documentation sites, e-commerce with <1M products. Understanding its capabilities helps you make informed decisions.


The Interview Questions They’ll Ask

  1. “What is a tsvector in PostgreSQL?”
  2. “When would you use PostgreSQL FTS vs Elasticsearch?”
  3. “How does a GIN index differ from a B-tree index?”
  4. “How do you handle multi-language search in PostgreSQL?”
  5. “What are the limitations of PostgreSQL full-text search?”

Learning milestones:

  1. Basic FTS queries work → You understand tsvector/tsquery
  2. GIN index speeds up queries → You understand indexing
  3. You can articulate PostgreSQL vs Elasticsearch trade-offs → You’re ready for architecture decisions

Project 11: Query Parser with Boolean Logic

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Python, Go, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Parsing / Language Theory
  • Software or Tool: Parser library
  • Main Book: “Writing a C Compiler” by Nora Sandler (parsing chapters)

What you’ll build: A query parser that handles boolean operators (AND, OR, NOT), parentheses for grouping, phrase queries, field-specific searches (title:python), and wildcards—like Lucene’s query syntax.

Why it teaches full-text search: The gap between user input and executable query is huge. Users type python AND (web OR api) title:"tutorial", and your system must parse this into a query plan that efficiently executes against the index.

Core challenges you’ll face:

  • Lexical analysis (tokenizing queries) → maps to understanding tokenizers
  • Parsing with operator precedence → maps to recursive descent or Pratt parsing
  • Building a query AST → maps to abstract syntax tree construction
  • Optimizing query execution → maps to query planning

Key Concepts:

  • Recursive Descent Parsing: “Writing a C Compiler” Ch. 2
  • Pratt Parsing: Pratt’s original paper or Crafting Interpreters Ch. 17
  • Query Optimization: “Introduction to Information Retrieval” Ch. 7

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Basic parsing knowledge, Project 1


Real World Outcome

Example Output:

$ ./query-parser parse 'python AND (web OR api) title:"machine learning"'

=== QUERY PARSE TREE ===

Input: python AND (web OR api) title:"machine learning"

Tokens:
  [TERM:python] [AND] [LPAREN] [TERM:web] [OR] [TERM:api] [RPAREN]
  [FIELD:title] [PHRASE:"machine learning"]

AST:
  AND
  ├── TERM("python", field=*)
  ├── OR
  │   ├── TERM("web", field=*)
  │   └── TERM("api", field=*)
  └── PHRASE("machine learning", field=title)

Optimized Plan:
  1. Evaluate PHRASE on title field (most selective)
  2. Intersect with "python" postings
  3. Union "web" and "api" postings
  4. Intersect result of step 2 with step 3

Estimated cost: 0.34 (fast query)

$ ./query-parser parse 'foo AND NOT bar OR baz'

=== OPERATOR PRECEDENCE ===

Without parentheses, parsed as:
  (foo AND (NOT bar)) OR baz

NOT binds tighter than AND, which binds tighter than OR.

If you meant: foo AND NOT (bar OR baz)
Use explicit parentheses!

The Core Question You’re Answering

“How does a search engine turn ‘python AND (web OR api)’ into an efficient query plan?”

Parsing. You must tokenize the query string, build an abstract syntax tree respecting operator precedence, and then optimize execution order based on selectivity estimates.


The Interview Questions They’ll Ask

  1. “How would you implement a query parser for boolean search?”
  2. “What is operator precedence and how do you handle it?”
  3. “How would you extend the parser for new syntax (e.g., boosting)?”
  4. “What optimizations can you do on a query AST before execution?”
  5. “How do you handle user errors in queries gracefully?”

Learning milestones:

  1. Boolean queries parse correctly → You understand recursive descent
  2. Operator precedence works → You understand grammar design
  3. Query execution is optimized → You understand query planning

Project 12: Autocomplete with Prefix Matching

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: TypeScript
  • Alternative Programming Languages: Rust, Python, Go
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Structures / UX
  • Software or Tool: Autocomplete service
  • Main Book: “Algorithms” by Sedgewick (Tries chapter)

What you’ll build: A fast autocomplete system that suggests completions as users type, using tries or FSTs for prefix matching, with popularity-weighted ranking.

Why it teaches full-text search: Autocomplete is the first thing users interact with. It must be FAST (<50ms) and SMART (suggest popular/relevant completions). This teaches you prefix data structures and the importance of precomputation.

Core challenges you’ll face:

  • Fast prefix lookup → maps to trie/FST traversal
  • Ranking suggestions → maps to popularity and recency scoring
  • Handling typos in prefix → maps to fuzzy prefix matching
  • Scaling to millions of terms → maps to memory-efficient data structures

Key Concepts:

  • Prefix Trees: “Algorithms” Ch. 5.2
  • Completion Scoring: Meilisearch’s approach
  • Edge N-grams: Elasticsearch documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 8 (Tries/FST)


Real World Outcome

Example Output:

$ ./autocomplete serve --data products.json --port 8080
Loading 1.2M products...
Building prefix index...
Ready! Listening on :8080

$ curl "localhost:8080/suggest?q=iph"

{
  "query": "iph",
  "suggestions": [
    {"text": "iphone 15 pro", "score": 0.95, "category": "Electronics"},
    {"text": "iphone 15", "score": 0.92, "category": "Electronics"},
    {"text": "iphone case", "score": 0.78, "category": "Accessories"},
    {"text": "iphone charger", "score": 0.71, "category": "Accessories"}
  ],
  "time_ms": 2.3
}

$ curl "localhost:8080/suggest?q=iphon&fuzzy=true"

{
  "query": "iphon",
  "suggestions": [
    {"text": "iphone 15 pro", "score": 0.93, "fuzzy_match": true},
    {"text": "iphone 15", "score": 0.90, "fuzzy_match": true}
  ]
}

The Core Question You’re Answering

“How does Google show suggestions before you even finish typing?”

Prefix data structures (tries/FSTs) enable instant prefix lookup. Combined with popularity scoring and precomputation, you can suggest completions in under 10ms.


The Interview Questions They’ll Ask

  1. “How would you design an autocomplete system for 1 billion queries?”
  2. “What data structure would you use and why?”
  3. “How do you handle typos in autocomplete?”
  4. “How do you rank suggestions—just popularity, or something smarter?”
  5. “How do you update the suggestion index in real-time?”

Learning milestones:

  1. Prefix suggestions work fast → You understand tries
  2. Ranking feels natural → You understand scoring
  3. System handles typos → You understand fuzzy prefix matching

Project 13: Faceted Search and Aggregations

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Python, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: E-commerce / Analytics
  • Software or Tool: Search with facets
  • Main Book: “Solr in Action” by Grainger & Potter

What you’ll build: A search engine that computes facet counts (e.g., “Color: Red (42), Blue (38)”) and aggregations (e.g., “Average price: $45.99”) alongside search results—like every e-commerce site.

Why it teaches full-text search: Facets transform search from “find things” to “explore a catalog.” Understanding how to efficiently compute facet counts teaches you about doc values, field data, and the trade-offs between indexing and query-time computation.

Core challenges you’ll face:

  • Computing facet counts efficiently → maps to understanding doc values vs inverted index
  • Hierarchical facets (Category > Subcategory) → maps to tree aggregation
  • Range facets (price ranges) → maps to numeric indexing
  • Combining facets with search filters → maps to filter caching

Difficulty: Advanced Time estimate: 2 weeks Prerequisites: Project 1, Project 4


Real World Outcome

Example Output:

$ ./facet-search query "laptop" --facets category,brand,price_range

=== SEARCH: "laptop" ===

Results: 1,247 laptops found

FACETS:

Category:
  ├── Electronics (1,247)
  │   ├── Computers (1,180)
  │   │   ├── Laptops (1,180)
  │   │   └── Accessories (67)

Brand:
  ├── Apple (234)
  ├── Dell (189)
  ├── Lenovo (156)
  ├── HP (142)
  └── [+12 more...]

Price Range:
  ├── $0-500 (312)
  ├── $500-1000 (456)
  ├── $1000-2000 (367)
  └── $2000+ (112)

AGGREGATIONS:
  ├── Avg Price: $1,234.56
  ├── Min Price: $299.00
  └── Max Price: $4,999.00

$ ./facet-search query "laptop" --filter "brand:Apple" --filter "price:[1000 TO 2000]"

=== FILTERED SEARCH ===

Results: 87 Apple laptops between $1000-$2000

Facet counts updated to reflect filters!

The Core Question You’re Answering

“How does Amazon show ‘1,234 results’ with facet breakdowns by category, brand, and price—all in milliseconds?”

The answer is doc values (columnar storage for field values) and careful pre-computation. Facet counts are computed by scanning a compact column of values, not by re-querying the inverted index.


The Interview Questions They’ll Ask

  1. “How would you implement faceted search efficiently?”
  2. “What’s the difference between doc values and the inverted index?”
  3. “How do you handle hierarchical facets?”
  4. “How do facet counts update when filters are applied?”
  5. “What’s the performance trade-off for real-time vs cached facets?”

Learning milestones:

  1. Facet counts are accurate → You understand doc values
  2. Filtering updates facets correctly → You understand filter interaction
  3. Performance is acceptable → You understand caching strategies

Project 14: Distributed Search Cluster

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Distributed Systems
  • Software or Tool: Distributed search engine
  • Main Book: “Designing Data-Intensive Applications” by Kleppmann

What you’ll build: A multi-node search cluster with sharding, replication, query routing, and result merging—a mini Elasticsearch.

Why it teaches full-text search: Single-node search is easy. Distributed search is HARD. This project teaches you the scatter-gather pattern, consistency trade-offs, and why Elasticsearch configuration is so complex.

Core challenges you’ll face:

  • Sharding data across nodes → maps to partition strategies
  • Replicating for fault tolerance → maps to consistency models
  • Routing queries to correct shards → maps to coordinator pattern
  • Merging ranked results from shards → maps to distributed top-K

Difficulty: Master Time estimate: 1 month+ Prerequisites: All previous projects, distributed systems experience


Real World Outcome

Example Output:

$ ./cluster-search start --nodes 3 --shards 6 --replicas 1

=== DISTRIBUTED SEARCH CLUSTER ===

Starting node-1 on :9200
Starting node-2 on :9201
Starting node-3 on :9202

Shard allocation:
  node-1: shard-0 (P), shard-1 (P), shard-3 (R), shard-5 (R)
  node-2: shard-2 (P), shard-3 (P), shard-0 (R), shard-4 (R)
  node-3: shard-4 (P), shard-5 (P), shard-1 (R), shard-2 (R)

Cluster health: GREEN (all shards allocated)

$ ./cluster-search index --bulk ./million_docs.json
Indexing 1,000,000 documents...
Routing to shards by hash(doc_id) % 6
  Shard 0: 166,823 docs
  Shard 1: 166,412 docs
  ...
Complete! 1M docs indexed in 45.3 seconds

$ ./cluster-search query "distributed systems" --explain

=== DISTRIBUTED QUERY ===

Coordinator: node-1
Query broadcast to: shard-0, shard-1, shard-2, shard-3, shard-4, shard-5

Shard responses:
  shard-0 (node-1): 234 hits, top score 12.4, took 23ms
  shard-1 (node-1): 189 hits, top score 11.2, took 21ms
  shard-2 (node-2): 201 hits, top score 13.1, took 25ms
  ...

Merging top-10 from all shards...
Final results: 1,247 total hits

$ ./cluster-search kill node-2
Node-2 killed!

$ ./cluster-search status
Cluster health: YELLOW
  shard-2: Promoting replica on node-3 to primary
  shard-3: Promoting replica on node-1 to primary

Rebalancing...
Cluster health: GREEN

The Core Question You’re Answering

“How does Elasticsearch search across billions of documents on hundreds of nodes in milliseconds?”

Sharding divides the data; replication provides fault tolerance; the scatter-gather pattern parallelizes search; and careful result merging produces globally correct rankings.


The Interview Questions They’ll Ask

  1. “How would you design a distributed search system?”
  2. “What happens when a node fails during a query?”
  3. “How do you ensure globally correct rankings across shards?”
  4. “What’s the trade-off between more shards and query latency?”
  5. “How does replication interact with consistency?”

Learning milestones:

  1. Multi-node queries work → You understand scatter-gather
  2. Node failure is handled → You understand replication
  3. Rankings are globally correct → You understand distributed top-K

Project 15: Hybrid Search (Keyword + Vector)

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: ML / Embeddings / Modern Search
  • Software or Tool: Hybrid search engine
  • Main Book: “AI Engineering” by Chip Huyen

What you’ll build: A search engine that combines traditional keyword search (BM25) with semantic vector search (embeddings), using reciprocal rank fusion to merge results—the future of search.

Why it teaches full-text search: Modern search isn’t just keywords OR vectors—it’s both. Understanding how to combine them teaches you the limits of each approach and prepares you for RAG (Retrieval-Augmented Generation) systems.

Core challenges you’ll face:

  • Generating embeddings → maps to understanding transformer models
  • Approximate nearest neighbor search → maps to HNSW, IVF indexes
  • Fusion of keyword and vector results → maps to RRF, score normalization
  • Balancing precision and recall → maps to hybrid tuning

Difficulty: Expert Time estimate: 2-3 weeks Prerequisites: Projects 1-4, basic ML understanding


Real World Outcome

Example Output:

$ ./hybrid-search index --docs ./knowledge_base.json --embed-model all-MiniLM-L6-v2

=== HYBRID INDEX BUILD ===

Indexing 50,000 documents...
Building BM25 inverted index...
Generating embeddings (384 dimensions)...
Building HNSW vector index...

Index complete!
  BM25 index: 12.3 MB
  Vector index: 78.4 MB (50K × 384 × 4 bytes + overhead)

$ ./hybrid-search query "how to deploy machine learning models to production"

=== HYBRID SEARCH ===

BM25 Results (keyword match):
  1. "Deploying ML Models with Docker" (BM25: 18.4)
  2. "Production Machine Learning" (BM25: 16.2)
  3. "Model Deployment Guide" (BM25: 15.1)

Vector Results (semantic similarity):
  1. "Getting Your Models Ready for Real Users" (sim: 0.87)
     └── Note: No keyword "deploy" but semantically relevant!
  2. "From Jupyter to Production" (sim: 0.84)
  3. "MLOps Best Practices" (sim: 0.82)

Reciprocal Rank Fusion (k=60):
  1. "Deploying ML Models with Docker" (RRF: 0.032)
  2. "Getting Your Models Ready for Real Users" (RRF: 0.029)
  3. "Production Machine Learning" (RRF: 0.028)

Vector search found documents that keyword search missed!
Keyword search provided precision for exact terms!

The Core Question You’re Answering

“Why do modern search engines combine keywords and embeddings, and how do you merge their results?”

Keywords are precise but miss synonyms and paraphrases. Vectors capture semantics but struggle with exact matches. Combining them gives you the best of both worlds—and is essential for RAG systems.


The Interview Questions They’ll Ask

  1. “What are the pros and cons of keyword vs vector search?”
  2. “How does reciprocal rank fusion work?”
  3. “What embedding model would you use and why?”
  4. “How do you index vectors efficiently (HNSW, IVF)?”
  5. “How would you tune the balance between keyword and vector results?”

Learning milestones:

  1. Both search types work independently → You understand each approach
  2. Fusion produces better results than either alone → You understand complementarity
  3. You can tune the hybrid balance → You’re ready for production RAG

Project Comparison Table

# Project Difficulty Time Depth of Understanding Fun Factor
1 Basic Inverted Index Beginner Weekend ★★★★★ ★★★☆☆
2 Text Analyzer Intermediate 1 week ★★★★☆ ★★★☆☆
3 TF-IDF Ranking Intermediate 1 week ★★★★★ ★★★★☆
4 BM25 Scoring Advanced 1-2 weeks ★★★★★ ★★★★☆
5 Positional Index Advanced 1-2 weeks ★★★★☆ ★★★★☆
6 Index Compression Expert 2 weeks ★★★★★ ★★★☆☆
7 Fuzzy Search Expert 2-3 weeks ★★★★★ ★★★★★
8 Trie/FST Expert 2-3 weeks ★★★★★ ★★★★☆
9 Segment Merging Expert 3-4 weeks ★★★★★ ★★★☆☆
10 PostgreSQL FTS Intermediate 1 week ★★★☆☆ ★★★★☆
11 Query Parser Advanced 2 weeks ★★★★☆ ★★★★☆
12 Autocomplete Intermediate 1 week ★★★☆☆ ★★★★★
13 Faceted Search Advanced 2 weeks ★★★★☆ ★★★★☆
14 Distributed Cluster Master 1 month+ ★★★★★ ★★★★★
15 Hybrid Search Expert 2-3 weeks ★★★★★ ★★★★★

Recommendation

For beginners: Start with Project 1 (Basic Inverted Index), then Project 3 (TF-IDF), then Project 10 (PostgreSQL FTS). This gives you the core concepts plus practical database integration.

For intermediate developers: Do Projects 1-4 in order, then jump to Project 7 (Fuzzy Search) or Project 12 (Autocomplete) for visible, fun results.

For advanced/systems engineers: Focus on Projects 6, 8, 9, 14—the low-level data structures and distributed systems that power production search engines.

For ML/AI engineers: Start with Projects 1-4 for foundations, then jump to Project 15 (Hybrid Search) to understand how traditional search integrates with embeddings for RAG.


Final Overall Project: Build a Mini Meilisearch

  • File: LEARN_FULL_TEXT_SEARCH_DEEP_DIVE.md
  • Main Programming Language: Rust
  • Alternative Programming Languages: Go, C++
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Full-Stack Search Engine
  • Software or Tool: Complete search engine
  • Main Book: All books from this learning path

What you’ll build: A complete, production-quality search engine with: inverted index with compression, BM25 ranking, typo tolerance, faceted search, real-time indexing with segments, a query parser, and an HTTP API—inspired by Meilisearch.

Why this is the capstone project: This combines EVERYTHING from the previous projects into a single, cohesive system. You’ll make architectural decisions about trade-offs, experience the joy of seeing it all work together, and have a project that demonstrates deep search expertise.

Core challenges you’ll face:

  • Integrating all components → maps to system design and API boundaries
  • Achieving sub-50ms query latency → maps to performance optimization
  • Handling concurrent reads and writes → maps to production systems
  • Building a usable HTTP API → maps to developer experience

What the final product looks like:

$ ./minimeili start --data-dir ./index --port 7700
🔎 MiniMeili starting...
   Index loaded: 0 documents
   HTTP API listening on :7700

$ curl -X POST localhost:7700/indexes/products/documents \
    -H "Content-Type: application/json" \
    -d '[{"id":1,"name":"iPhone 15 Pro","price":999},...]'

{"taskId": 1, "status": "enqueued"}

$ curl "localhost:7700/indexes/products/search?q=iphne&limit=10"

{
  "hits": [
    {
      "id": 1,
      "name": "iPhone 15 Pro",
      "price": 999,
      "_matchesInfo": {
        "name": [{"start": 0, "length": 6}]
      }
    }
  ],
  "query": "iphne",
  "processingTimeMs": 12,
  "estimatedTotalHits": 47,
  "facets": {
    "brand": {"Apple": 23, "Samsung": 12},
    "priceRange": {"$0-500": 15, "$500-1000": 32}
  }
}

Time estimate: 2-3 months Prerequisites: Complete at least projects 1-8


Summary

This learning path covers Full-Text Search through 15 hands-on projects plus one capstone. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 Basic Inverted Index Rust Beginner Weekend
2 Text Analyzer with Stemming Python Intermediate 1 week
3 TF-IDF Ranking Engine Python Intermediate 1 week
4 BM25 Scoring Implementation Rust Advanced 1-2 weeks
5 Positional Index and Phrase Queries C Advanced 1-2 weeks
6 Index Compression C Expert 2 weeks
7 Fuzzy Search with Levenshtein Automata Rust Expert 2-3 weeks
8 Build a Trie and FST from Scratch Rust Expert 2-3 weeks
9 Segment-Based Index with Merging Rust Expert 3-4 weeks
10 PostgreSQL Full-Text Search SQL/Python Intermediate 1 week
11 Query Parser with Boolean Logic Rust Advanced 2 weeks
12 Autocomplete with Prefix Matching TypeScript Intermediate 1 week
13 Faceted Search and Aggregations Go Advanced 2 weeks
14 Distributed Search Cluster Go Master 1 month+
15 Hybrid Search (Keyword + Vector) Python Expert 2-3 weeks
Mini Meilisearch (Capstone) Rust Master 2-3 months

For beginners: Start with projects #1, #2, #3, #10 For intermediate: Focus on projects #1, #3, #4, #7, #12 For advanced: Deep dive into projects #6, #8, #9, #14 For ML/AI focus: Complete #1-4, then #15

Expected Outcomes

After completing these projects, you will:

  1. Understand inverted indexes deeply—the data structure at the heart of every search engine
  2. Know exactly how ranking algorithms work—TF-IDF, BM25, and why they matter
  3. Be able to read Lucene/Elasticsearch source code—because you’ve built the components yourself
  4. Make informed architecture decisions—PostgreSQL FTS vs Elasticsearch vs custom solution
  5. Implement production-ready search features—fuzzy search, autocomplete, facets, highlighting
  6. Understand distributed search—sharding, replication, consistency trade-offs
  7. Bridge keyword and semantic search—prepare for RAG and modern AI applications

You’ll have built 16 working projects that demonstrate deep understanding of full-text search from first principles to distributed production systems.


Resources

Essential Books

  • “Introduction to Information Retrieval” by Manning, Raghavan, Schütze — Free online
  • “Search Engines: Information Retrieval in Practice” by Croft, Metzler, Strohman — Free PDF
  • “Information Retrieval: Implementing and Evaluating Search Engines” by Büttcher, Clarke, Cormack
  • “Managing Gigabytes” by Witten, Moffat, Bell
  • “Designing Data-Intensive Applications” by Martin Kleppmann

Key Papers

  • Porter, M.F. (1980): “An algorithm for suffix stripping”
  • Robertson, S.E. & Walker, S. (1994): “Some simple effective approximations to the 2-Poisson model” (BM25)
  • Schulz, K.U. & Mihov, S. (2002): “Fast string correction with Levenshtein automata”
  • Lemire, D. et al. (2016): “Roaring bitmaps”

Blogs & Online Resources


Happy searching! 🔎