← Back to all projects

LEARN INFORMATION RETRIEVAL DEEP DIVE

In the early 1990s, the internet was a chaotic library without a card catalog. Information Retrieval turned that chaos into the modern web. It is the science of finding needles in haystacks by transforming raw, messy human language into structured mathematical models.

Learn Information Retrieval: From Zero to Search Engine Master

Goal: Deeply understand the mathematical and algorithmic foundations of Information Retrieval (IR). You will move from treating search as a “grep” operation to understanding it as a probabilistic and vector-space challenge. By the end, you’ll be able to build high-performance search engines that handle millions of documents with sub-second relevance scoring using TF-IDF, BM25, and advanced indexing techniques.


Why Information Retrieval Matters

In the early 1990s, the internet was a chaotic library without a card catalog. Information Retrieval turned that chaos into the modern web. It is the science of finding “needles in haystacks” by transforming raw, messy human language into structured mathematical models.

Understanding IR matters because:

  • Scale: Linear scanning (grep) fails at the scale of the web. IR provides the sub-linear lookup structures (Inverted Indexes) that make Google-speed possible.
  • Relevance: In a world of infinite data, “finding” isn’t enough; “ranking” is everything. IR teaches you the math of why one document is “better” than another for a specific user.
  • Foundation of AI: Modern Large Language Models (LLMs) often use “Retrieval Augmented Generation” (RAG). You cannot master modern AI without mastering the “R” in RAG.

Core Concept Analysis

The fundamental data structure of IR. Instead of mapping Documents → Words, we map Words → Documents.

DOCUMENTS:
Doc 1: "The cat sat"
Doc 2: "The dog sat"

INVERTED INDEX:
"cat"  -> [1]
"dog"  -> [2]
"sat"  -> [1, 2]
"the"  -> [1, 2]

2. The Retrieval Pipeline

Transforming a user’s query into a ranked list of results.

[ User Query ] 
      │
      ▼
[ Tokenization ] (Break into words)
      │
      ▼
[ Normalization ] (Lowercasing, Stemming: "running" -> "run")
      │
      ▼
[ Index Lookup ] (Find candidate documents)
      │
      ▼
[ Scoring/Ranking ] (TF-IDF or BM25 math)
      │
      ▼
[ Sorted Results ]

3. Mathematical Relevance Scoring

Why does a document rank high?

  • TF (Term Frequency): How often a word appears in a document. (More = better).
  • IDF (Inverse Document Frequency): How rare a word is across the whole collection. (Rarer = more informative).
  • BM25 (Best Matching 25): The industry standard. It improves on TF-IDF by adding “saturation” (the 100th mention of ‘cat’ isn’t 100x better than the 1st) and “length normalization” (short documents shouldn’t be penalized).

Concept Summary Table

Concept Cluster What You Need to Internalize
Inverted Indexing How to build and compress a mapping of terms to document IDs for sub-millisecond retrieval.
Linguistic Processing Tokenization, stopword removal, and stemming (Porter/Snowball) to normalize language.
Vector Space Model Treating documents and queries as vectors in a multi-dimensional space to calculate similarity.
Probabilistic Ranking Using BM25 math to handle term saturation and document length normalization.
Evaluation Metrics How to measure success using Precision, Recall, MAP, and NDCG.

Deep Dive Reading by Concept

Foundational Theory

Concept Book & Chapter
Inverted Indexes Introduction to Information Retrieval by Manning et al. — Ch. 1: “Information retrieval using inverted indexes”
Scoring & TF-IDF Introduction to Information Retrieval by Manning et al. — Ch. 6: “Scoring, term weighting and the vector space model”
Probabilistic IR (BM25) Information Retrieval: Implementing and Evaluating Search Engines by Büttcher — Ch. 3: “Statistical characteristics of text”

Engineering & Implementation

Concept Book & Chapter
Index Compression Introduction to Information Retrieval by Manning et al. — Ch. 5: “Index compression”
Evaluation Introduction to Information Retrieval by Manning et al. — Ch. 8: “Evaluation in information retrieval”

Essential Reading Order

  1. The Core Mechanism (Week 1):
    • Manning Ch. 1 (Inverted Index)
    • Manning Ch. 2 (The Term Vocabulary)
  2. The Math of Ranking (Week 2):

Project 4: The TF-IDF Scorer (Weighting Relevance)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Python (with NumPy) or C++
  • Alternative Programming Languages: Rust, Scala
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Mathematics / Statistics
  • Software or Tool: Logarithms, Vector Math
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A scoring engine that calculates the “Term Frequency - Inverse Document Frequency” for every word in every document. This replaces “binary” matching (word is there or not) with a floating-point “importance” score.

Why it teaches IR: This is the gateway to Ranking. You’ll understand why common words like “the” carry zero weight, while rare words like “photosynthesis” carry high weight.

Core challenges you’ll face:

  • Logarithmic Scaling → maps to Why we use log(TF) instead of raw TF (diminishing returns).
  • IDF Calculation → maps to Keeping track of ‘Document Frequency’ (how many docs have this word) globally.
  • Division by Zero → maps to Handling words that appear in zero documents (smoothing).

Key Concepts:

  • TF-IDF Weighting: “Introduction to Information Retrieval” Ch. 6.2
  • Inverse Document Frequency: “Introduction to Information Retrieval” Ch. 6.2.1

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Basic algebra (logarithms); Project 2 (Inverted Index) completed.


Real World Outcome

A search tool that ranks “Doc A” higher than “Doc B” because the query term is more central to “Doc A”.

Example Output:

$ ./search "climate change"
1. doc_12.txt (Score: 0.85) - "The impact of climate change on..."
2. doc_05.txt (Score: 0.42) - "Weather is different from climate..."
3. doc_88.txt (Score: 0.05) - "I decided to change my clothes..."

The Core Question You’re Answering

“How do we mathematically define what a document is ‘about’?”

If a document mentions “apple” 50 times, and the whole collection has 10,000 documents but only 5 mention “apple,” that word is a powerful signal. If 9,000 docs mention “apple,” it’s just noise.


Hints in Layers

Hint 1: Storage Your Inverted Index from Project 2 needs to change. Instead of just storing [DocID1, DocID2], it must now store [(DocID1, count), (DocID2, count)].

Hint 2: IDF Global State You need to know N (total number of documents) and df (number of documents containing the term) to calculate log(N/df).

Hint 3: The Formula Score = (1 + log(tf)) * log(N/df).


Project 5: The Vector Space Model & Cosine Similarity

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Python (NumPy) or C++
  • Alternative Programming Languages: Rust, Julia
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Linear Algebra
  • Software or Tool: Dot Products, Vector Normalization
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A system that treats every document and every query as a vector in a massive multi-dimensional space (one dimension per unique word). You will calculate the “angle” between the query vector and document vectors to find matches.

Why it teaches IR: It introduces Geometric Interpretation. You’ll understand why long documents shouldn’t rank higher just because they have more words (Length Normalization).

Core challenges you’ll face:

  • High Dimensionality → maps to Handling vectors with 50,000+ dimensions efficiently (Sparse Vectors).
  • Cosine Similarity Math → maps to Implementing the dot product and magnitude calculations.
  • Normalization → maps to Ensuring document length doesn’t bias the score.

Key Concepts:

  • The Vector Space Model: “Introduction to Information Retrieval” Ch. 6.3
  • Cosine Similarity: “Introduction to Information Retrieval” Ch. 6.3.1

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Understanding of vectors and dot products.


Thinking Exercise

The Long Document Bias

Imagine Doc A is 10 words long and mentions “cats” twice. Doc B is 10,000 words long and mentions “cats” 20 times.

  1. Which document is “more” about cats?
  2. If you only use raw Term Frequency, Doc B wins.
  3. How does dividing by the vector’s “magnitude” (L2 norm) fix this?

Project 6: The BM25 Ranker (Industry Standard)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Python or C++
  • Alternative Programming Languages: Rust, Go
  • Coolness Level: Level 5: Pure Magic
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Probabilistic Modeling
  • Software or Tool: Hyperparameter Tuning (k1, b)
  • Main Book: “Information Retrieval: Implementing and Evaluating Search Engines” by BĂźttcher

What you’ll build: An implementation of Okapi BM25, the ranking function used by Lucene, Elasticsearch, and Solr. It’s a more sophisticated version of TF-IDF that handles “term saturation.”

Why it teaches IR: It moves you from simple heuristics to a Probabilistic Framework. You’ll learn why the 100th occurrence of a word adds less value than the 2nd (Saturation) and how to tune the engine for different types of data.

Core challenges you’ll face:

  • Calculating Average Document Length → maps to Required for the ‘b’ parameter in length normalization.
  • Parameter Tuning → maps to Understanding what ‘k1’ (saturation) and ‘b’ (length penalty) actually do.
  • Efficiency → maps to Pre-calculating components of the formula to speed up query time.

Key Concepts:

  • BM25 Formula: “Information Retrieval: Implementing and Evaluating Search Engines” Ch. 3.4
  • Term Saturation: Understanding why TF doesn’t grow linearly.

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 4 (TF-IDF) completed.


Real World Outcome

You will have a ranking engine that produces results nearly identical to professional-grade search engines like Elasticsearch.

Example Output:

# TF-IDF vs BM25 comparison
$ ./compare --query "space exploration"
[TF-IDF Result]: Doc #42 (Score: 12.5)
[BM25 Result]:   Doc #10 (Score: 18.2) # Better relevance due to length normalization!

The Interview Questions They’ll Ask

  1. “Why is BM25 generally preferred over basic TF-IDF?”
  2. “What does the ‘b’ parameter in BM25 control, and when would you set it to 0?”
  3. “How does BM25 handle the ‘stopword’ problem without an explicit list?”

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: C++ or Rust
  • Alternative Programming Languages: Java, Python
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Algorithm Design
  • Software or Tool: Proximity Algorithms
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: An upgraded Inverted Index that stores the exact position of every word in every document. This allows you to support “Phrase Queries” (e.g., searching for “to be or not to be” exactly) and proximity searches (e.g., “cats” within 5 words of “dogs”).

Why it teaches IR: It shows the Complexity of Proximity. You’ll realize that ranking isn’t just about “what” words are there, but “where” they are relative to each other.

Core challenges you’ll face:

  • Exploding Index Size → maps to Storing positions can double or triple your index size.
  • Linear Merge with Positions → maps to Efficiently checking if Term B appears at Position X+1 when Term A is at Position X.
  • Performance → maps to How to skip documents that have both words but not in the right order.

Key Concepts:

  • Positional Indexes: “Introduction to Information Retrieval” Ch. 2.4.2
  • Biword Indexes: An alternative approach (Ch. 2.4.1).

Difficulty: Advanced Time estimate: 1-2 weeks Prerequisites: Project 2 (Inverted Index) completed.


Real World Outcome

A search engine that handles quotes and proximity constraints.

Example Output:

$ ./search '"to be or not to be"'
1. hamlet.txt (Line 54)
$ ./search 'cats NEAR/5 dogs'
1. pets_manual.txt ("...when cats are introduced to dogs...")

Questions to Guide Your Design

  1. Storage Format
    • In your postings list, will you store positions as [DocID, [pos1, pos2, ...]]?
    • How does this affect the memory layout for cache efficiency?
  2. The “Next” Algorithm
    • Can you implement an algorithm that finds the next occurrence of a phrase without decoding the entire positional list?

Project 8: Index Compression (Variable Byte Encoding)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: C or Rust
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Systems Programming / Bit Manipulation
  • Software or Tool: Bitwise Ops, Gap Encoding
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A compression layer for your index. Instead of storing 4-byte integers for Document IDs, you’ll store “Gaps” between DocIDs and use Variable Byte (VB) encoding or Elias Gamma coding to minimize space.

Why it teaches IR: It teaches Resource Constraints. Real search engines store billions of documents; if you can’t compress the index, you can’t afford to run the search engine.

Core challenges you’ll face:

  • Gap Encoding → maps to Converting [100, 105, 110] into [100, 5, 5].
  • Bit-Level Manipulations → maps to Writing code that operates on variable bit lengths.
  • Decompression Speed → maps to Ensuring that the time saved on Disk I/O isn’t lost to CPU overhead during decompression.

Key Concepts:

  • Index Compression: “Introduction to Information Retrieval” Ch. 5
  • Variable Byte Codes: “Introduction to Information Retrieval” Ch. 5.3.1
  • Elias Gamma Coding: “Introduction to Information Retrieval” Ch. 5.3.2

Difficulty: Expert Time estimate: 2 weeks Prerequisites: Strong bitwise manipulation skills; Project 2 completed.


Thinking Exercise

The Gap Advantage

If you have DocIDs [1,000,000, 1,000,005, 1,000,010], they all require 32 bits (4 bytes) to store. If you store the Gaps: [1,000,000, 5, 5]:

  • The first one still needs 4 bytes.
  • The next two (5) only need 3 bits! Question: How much space do you save on a postings list of 1 million consecutive documents?

Project 9: The Evaluation Workbench (IR Metrics)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Julia
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Science / Analytics
  • Software or Tool: Precision/Recall Curves
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A tool that takes a “Ground Truth” (a list of queries and which documents should be returned) and evaluates your search engine’s performance using Precision, Recall, F-measure, and NDCG (Normalized Discounted Cumulative Gain).

Why it teaches IR: You’ll learn that You can’t improve what you don’t measure. You’ll see how changing a stemming rule or a BM25 parameter actually affects the user experience.

Core challenges you’ll face:

  • Calculating NDCG → maps to Handling ‘graded’ relevance (e.g., Doc A is 3/5 relevant, Doc B is 5/5).
  • Recall at K → maps to Measuring how many relevant docs are found in the top 10 results.
  • Automation → maps to Running hundreds of queries and aggregating the stats.

Key Concepts:

  • Evaluation in IR: “Introduction to Information Retrieval” Ch. 8
  • NDCG: “Introduction to Information Retrieval” Ch. 8.4

Difficulty: Intermediate Time estimate: 1 week Prerequisites: A working search engine (from Projects 1-6).


The Interview Questions They’ll Ask

  1. “What is the difference between Precision and Recall? Can you have 100% of both?”
  2. “Why is NDCG better than simple Precision for ranked results?”
  3. “Explain the ‘Cranfield Paradigm’ in IR evaluation.”

Project 10: Query Expansion (The Synonym Engine)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Go, Java
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Knowledge Graphs / Semantics
  • Software or Tool: WordNet / Word2Vec
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A module that expands a user’s query with related terms. If a user searches for “automobile,” the engine automatically includes “car” and “vehicle” in the search.

Why it teaches IR: It addresses the Vocabulary Mismatch Problem. You’ll understand that users often don’t know the exact words used in the documents they want to find.

Core challenges you’ll face:

  • Drift → maps to Ensuring that expanding ‘Apple’ doesn’t accidentally bring in results for ‘Fruit’ when the user meant ‘iPhone’.
  • Relevance Weighting → maps to Weighting the expanded terms lower than the original query terms.
  • Performance → maps to How to expand the query without making the search 10x slower.

Key Concepts:

  • Query Expansion: “Introduction to Information Retrieval” Ch. 9.2.2
  • Thesaurus-based Expansion: Ch. 9.2.1

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 4 (TF-IDF) completed.


Real World Outcome

A search engine that “understands” intent better than simple keyword matching.

Example Output:

$ ./search "heart attack" --expand
Expanded query: (heart AND attack) OR (myocardial AND infarction)
1. health_report_2024.pdf (Score: 0.92) - "Analysis of myocardial infarction rates..."

The Core Question You’re Answering

“How do we find documents that are relevant but don’t contain the user’s exact keywords?”

This is the bridge between “Lexical Search” (words) and “Semantic Search” (meanings).


Project 11: Fielded Search (Multi-Field Indexing)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: C++ or Python
  • Alternative Programming Languages: Rust, Java
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Software Architecture
  • Software or Tool: Schema Definition
  • Main Book: “Information Retrieval: Implementing and Evaluating Search Engines” by BĂźttcher

What you’ll build: A search engine that understands “Fields” (e.g., Title, Body, Author, Date). It allows users to search specifically in one field or weight matches in the Title higher than matches in the Body.

Why it teaches IR: It teaches Structured Data in Unstructured Text. You’ll learn how to manage multiple inverted indexes (or one complex one) and how to aggregate scores across different fields (e.g., BM25F).

Core challenges you’ll face:

  • Schema Management → maps to How to define which fields are searchable and which are only ‘stored’.
  • Cross-Field Scoring → maps to Should a match in the Title be worth 5x a match in the Body?
  • Index Layout → maps to Storing field markers in the postings list.

Key Concepts:

  • Fielded Retrieval: “Introduction to Information Retrieval” Ch. 6.1
  • BM25F: An extension of BM25 for multiple fields.

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Project 6 (BM25) completed.


Questions to Guide Your Design

  1. Scoring Strategy
    • If a word appears in both the Title and the Body, do you sum the scores?
    • What if the Title is only 5 words but the Body is 5000? How does length normalization work across fields?
  2. User Interface
    • How will the query language look? (e.g., title:climate body:carbon)

Project 12: Distributed Indexing (Sharding)

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Go or Rust
  • Alternative Programming Languages: C++, Java
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems
  • Software or Tool: RPC (gRPC), Sharding
  • Main Book: “Introduction to Information Retrieval” by Manning et al.

What you’ll build: A system that splits a massive index across multiple machines (Shards). A “Coordinator” node receives a query, sends it to all Shards, and merges the results into a single ranked list.

Why it teaches IR: It teaches Horizontal Scalability. You’ll understand the challenges of calculating global statistics (like IDF) when the data is spread across a cluster.

Core challenges you’ll face:

  • Global IDF Consistency → maps to How does Shard A know the global frequency of a word if it only sees 1/10th of the data?
  • Network Latency → maps to Handling slow Shards (the ‘Tail Latency’ problem).
  • Result Merging → maps to Ensuring the top 10 results from Shard A are comparable to Shard B.

Key Concepts:

  • Distributed Indexing: “Introduction to Information Retrieval” Ch. 4.4
  • Distributed Query Processing: Ch. 20.2 (Web Search context)

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Project 4 (TF-IDF) completed; basic knowledge of network programming (TCP/RPC).


Real World Outcome

A search cluster that can handle 10x more data than a single machine.

Example Output:

$ ./coordinator --query "distributed systems"
- Sending sub-queries to Node_1, Node_2, Node_3...
- Shard 1 returned 5 results
- Shard 2 returned 12 results
- Shard 3 returned 8 results
- Merging and Re-ranking...
1. doc_999 (Score: 0.98) - [From Node_2]
2. doc_123 (Score: 0.95) - [From Node_1]

Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Linguistic Pipeline Level 1 Weekend High (Foundational) ★★★☆☆
2. Inverted Index Level 2 1 Week Critical (Core) ★★★★☆
3. Porter Stemmer Level 3 1-2 Weeks Deep (Logic) ★★☆☆☆
4. TF-IDF Scorer Level 2 1 Week High (Math) ★★★★☆
5. Vector Space Model Level 3 2 Weeks Deep (Geometry) ★★★★☆
6. BM25 Ranker Level 3 1-2 Weeks Master (Probabilistic) ★★★★★
7. Positional Index Level 3 2 Weeks High (Algorithms) ★★★★☆
8. Index Compression Level 4 2 Weeks Deep (Systems) ★★★☆☆
9. Evaluation Workbench Level 2 1 Week Essential (Stats) ★★★☆☆
10. Query Expansion Level 2 1 Week Medium (Semantics) ★★★★☆
11. Fielded Search Level 2 1 Week Medium (Architecture) ★★★☆☆
12. Distributed Indexing Level 4 3-4 Weeks Expert (Scaling) ★★★★★

Recommendation

Where to start?

  • If you are a Math enthusiast: Start with Project 4 (TF-IDF) and Project 5 (Vector Space). Understanding the geometry of text will hook you.
  • If you are a Systems/C developer: Start with Project 2 (Inverted Index) and Project 8 (Compression). You’ll love the bit-packing and performance challenges.
  • For a standard path: Follow the numbers 1 through 6. This builds the fundamental “Relevance” engine.

Final Overall Project: “Arxiv-Search: A Specialized Academic Engine”

  • File: LEARN_INFORMATION_RETRIEVAL_DEEP_DIVE.md
  • Main Programming Language: Rust or Go
  • Goal: Apply everything you’ve learned to build a production-ready search engine for a specific dataset (e.g., all CS papers on ArXiv).

Key Features to Implement:

  1. Hybrid Indexing: Support both keyword search (BM25) and basic semantic search (via query expansion).
  2. Fielded Precision: Weight matches in the “Abstract” and “Title” differently.
  3. Optimized Storage: Use Gap-encoding and Variable Byte compression for the index.
  4. Interactive Evaluation: Build a small CLI that allows you to mark results as relevant/irrelevant and calculates your NDCG on the fly.
  5. Snippet Highlighting: Generate dynamic “summaries” of results showing where the query terms appeared in context.

Success Criteria: You can search 100,000 papers and get a ranked list of relevant results in under 50ms on your laptop.


Summary

This learning path covers Information Retrieval through 12 hands-on projects. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 Linguistic Pipeline Python Level 1 Weekend
2 Inverted Index C++/Python Level 2 1 Week
3 Porter Stemmer Any Level 3 1-2 Weeks
4 TF-IDF Scorer Python/C++ Level 2 1 Week
5 Vector Space Model Python Level 3 2 Weeks
6 BM25 Ranker Python/C++ Level 3 1-2 Weeks
7 Positional Index C++/Rust Level 3 2 Weeks
8 Index Compression C/Rust Level 4 2 Weeks
9 Evaluation Workbench Python Level 2 1 Week
10 Query Expansion Python Level 2 1 Week
11 Fielded Search C++/Python Level 2 1 Week
12 Distributed Indexing Go/Rust Level 4 3-4 Weeks

For beginners: Start with projects #1, #2, #4. For intermediate: Focus on #5, #6, #9. For advanced: Tackle #7, #8, #12.

Expected Outcomes

After completing these projects, you will:

  • Understand the exact binary structure of high-performance search indexes.
  • Be able to implement industry-standard ranking formulas (BM25) from scratch.
  • Master the linguistic nuances of tokenization and stemming.
  • Navigate the trade-offs between index size, search speed, and relevance.
  • Be capable of evaluating and debugging search quality using data-driven metrics.

You’ll have built 12 working projects that demonstrate deep understanding of Information Retrieval from first principles.

  1. “How do you handle a shard failure in a production search environment?”
  2. “What is the difference between Document Partitioning and Term Partitioning in a distributed index?”