NLP with Python and spaCy: From Text to Intelligence

Goal: Master Natural Language Processing from the ground up using Python and spaCy. You will progress from understanding how computers break text into meaningful units to building production-ready NLP pipelines that extract entities, classify documents, answer questions, and power intelligent chatbots. By the end, you’ll deeply understand linguistic concepts, statistical models, and the engineering required to process human language at scale.

Why NLP Matters

In 1950, Alan Turing proposed the “Imitation Game”—could a machine converse so naturally that a human couldn’t distinguish it from another human? Today, NLP has moved from science fiction to science fact. Every time you ask Siri a question, get an email auto-completed, or have spam filtered from your inbox, you’re using NLP.

Understanding NLP matters because:

Ubiquity: The global NLP market was valued at $29.1 billion in 2024 and is projected to reach $158.6 billion by 2032. Every industry—healthcare, finance, legal, customer service—is being transformed by language AI.
Data Explosion: 80% of enterprise data is unstructured text—emails, documents, social media, chat logs. NLP is the key to unlocking insights from this massive, untapped resource.
Foundation for Modern AI: Large Language Models (LLMs) like GPT-4 and Claude are built on NLP foundations. Understanding tokenization, embeddings, and linguistic structure is essential for working with any modern AI system.
Career Leverage: NLP engineers command premium salaries. According to Glassdoor, the average NLP engineer salary in the US is $156,000, with senior roles exceeding $250,000.

The NLP Landscape
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Raw Text ──► Preprocessing ──► Analysis ──► Understanding      │
│                                                                 │
│  "The quick    Tokenization     POS Tags      Meaning &         │
│   brown fox"   Lemmatization    Entities      Intent            │
│                Cleaning         Relations                       │
│                                                                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional NLP              vs        Modern NLP              │
│  ┌─────────────────┐              ┌─────────────────┐          │
│  │ Hand-crafted    │              │ Learned from    │          │
│  │ rules           │              │ data            │          │
│  │ Regex patterns  │              │ Neural networks │          │
│  │ Grammar parsers │              │ Transformers    │          │
│  └─────────────────┘              └─────────────────┘          │
│                                                                 │
│  spaCy bridges both worlds: industrial-strength, efficient,     │
│  with pre-trained models and transformer integration            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Why spaCy?

spaCy has become the industry standard for production NLP because:

Speed: Written in Cython, spaCy is one of the fastest NLP libraries—10-20x faster than NLTK for many tasks.
Batteries Included: Pre-trained models for 75+ languages, ready-to-use pipelines, and sensible defaults.
Production-Ready: Designed for real applications, not just research. Clear APIs, consistent interfaces, and excellent documentation.
Modern Architecture: Native support for transformers (BERT, RoBERTa), custom training pipelines, and extensible components.
Active Development: Backed by Explosion AI, with regular updates, new features, and strong community support.

spaCy's Processing Pipeline
┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│  Text ──►┌─────────┐──►┌─────────┐──►┌─────────┐──►┌─────────┐ │
│          │Tokenizer│   │  Tagger │   │ Parser  │   │   NER   │ │
│          └─────────┘   └─────────┘   └─────────┘   └─────────┘ │
│              │              │             │             │       │
│              ▼              ▼             ▼             ▼       │
│          [Tokens]      [POS Tags]   [Dependencies]  [Entities] │
│                                                                 │
│  All components are modular and can be:                        │
│  • Enabled/disabled                                            │
│  • Replaced with custom implementations                        │
│  • Extended with new functionality                             │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prerequisites & Background Knowledge

Essential Prerequisites (Must Have)

Python Fundamentals: Functions, classes, list comprehensions, decorators
Basic Data Structures: Lists, dictionaries, sets, tuples
File I/O: Reading/writing text files, CSV handling
Command Line Basics: Running Python scripts, pip installation

Helpful But Not Required

Basic Statistics: Mean, variance, probability distributions
Regular Expressions: Pattern matching (helpful for text preprocessing)
Machine Learning Basics: Training/test splits, evaluation metrics
Linguistics Concepts: Grammar, parts of speech (you’ll learn these)

Self-Assessment Questions

Before starting, you should be able to answer:

“How would you read a large text file line by line in Python without loading it all into memory?”
“What’s the difference between a list and a generator in Python?”
“How would you count word frequencies in a string using a dictionary?”
“What does pip install do, and what’s a virtual environment?”

If you struggle with these, spend a few days on Python basics first.

Development Environment Setup

# Create a virtual environment
python -m venv nlp-env
source nlp-env/bin/activate  # On Windows: nlp-env\Scripts\activate

# Install spaCy
pip install spacy

# Download English language model (medium size, good balance)
python -m spacy download en_core_web_md

# For transformer models (optional, larger)
python -m spacy download en_core_web_trf

# Additional useful libraries
pip install pandas numpy scikit-learn matplotlib seaborn

# For Jupyter notebooks (recommended)
pip install jupyter

Time Investment

Experience Level	Expected Time
Python beginner	12-16 weeks
Python intermediate	8-10 weeks
Some ML/NLP experience	4-6 weeks

Reality Check

NLP is challenging because human language is inherently ambiguous. “Time flies like an arrow; fruit flies like a banana” contains the same words in similar patterns but wildly different meanings. You’ll encounter edge cases constantly. Embrace the ambiguity—it’s what makes NLP fascinating.

Core Concept Analysis

1. Tokenization: Breaking Text into Units

Tokenization is the first and most fundamental step in any NLP pipeline. It converts raw text into discrete units (tokens) that can be processed.

Input:  "Dr. Smith doesn't believe U.S.A. won't win."

Naive Split (by spaces):
["Dr.", "Smith", "doesn't", "believe", "U.S.A.", "won't", "win."]

spaCy Tokenization:
["Dr.", "Smith", "does", "n't", "believe", "U.S.A.", "wo", "n't", "win", "."]

Why the difference?
- Contractions are split: "doesn't" → "does" + "n't"
- Punctuation is separate: "win." → "win" + "."
- Abbreviations preserved: "Dr.", "U.S.A."

2. Part-of-Speech Tagging: Understanding Grammar

POS tagging assigns grammatical labels to each token—noun, verb, adjective, etc.

"The quick brown fox jumps over the lazy dog"

Token     │ POS Tag │ Meaning
──────────┼─────────┼─────────────────
The       │ DET     │ Determiner
quick     │ ADJ     │ Adjective
brown     │ ADJ     │ Adjective
fox       │ NOUN    │ Noun
jumps     │ VERB    │ Verb
over      │ ADP     │ Adposition (preposition)
the       │ DET     │ Determiner
lazy      │ ADJ     │ Adjective
dog       │ NOUN    │ Noun

Fine-Grained Tags (Penn Treebank):
jumps → VBZ (Verb, 3rd person singular present)
fox   → NN  (Noun, singular)

3. Named Entity Recognition (NER): Finding Real-World Entities

NER identifies and classifies named entities in text—people, organizations, locations, dates, etc.

"Apple CEO Tim Cook announced a new product at WWDC in San Francisco on June 5th."

Entity        │ Label │ Description
──────────────┼───────┼─────────────────────
Apple         │ ORG   │ Organization
Tim Cook      │ PERSON│ Person name
WWDC          │ EVENT │ Event name
San Francisco │ GPE   │ Geopolitical entity (city)
June 5th      │ DATE  │ Date expression

spaCy's Built-in Entity Types:
┌────────┬──────────────────────────────────────┐
│ PERSON │ People, including fictional          │
│ ORG    │ Companies, agencies, institutions    │
│ GPE    │ Countries, cities, states            │
│ LOC    │ Non-GPE locations (mountains, rivers)│
│ DATE   │ Absolute or relative dates           │
│ TIME   │ Times smaller than a day             │
│ MONEY  │ Monetary values                      │
│ PERCENT│ Percentages                          │
│ PRODUCT│ Objects, vehicles, foods             │
│ EVENT  │ Named events                         │
└────────┴──────────────────────────────────────┘

4. Dependency Parsing: Understanding Sentence Structure

Dependency parsing reveals the grammatical relationships between words.

"The cat sat on the mat"

          sat (ROOT)
         /   \
       cat    on
       /       \
     The       mat
               /
             the

Relations:
- "cat" is the nominal subject (nsubj) of "sat"
- "on" is a prepositional modifier (prep) of "sat"
- "mat" is the object of preposition (pobj)
- "The/the" are determiners (det)

5. Word Vectors and Similarity

Words can be represented as dense vectors in a high-dimensional space, where similar words are close together.

Vector Space Visualization (simplified to 2D):

                    ▲
                    │        ★ king
                    │                    ★ queen
         royalty   │
                    │
                    │        ★ man
                    │                    ★ woman
          common   │
                    │
                    └────────────────────────────────►
                         male              female

The famous relationship:
king - man + woman ≈ queen

Cosine Similarity:
sim(cat, dog) = 0.80  (high - both animals)
sim(cat, car) = 0.15  (low - unrelated)

6. The spaCy Doc Object Model

spaCy processes text into a rich object model:

                    Doc
                     │
         ┌───────────┼───────────┐
         │           │           │
       Tokens      Spans       Ents
         │           │           │
    ┌────┴────┐     Sent       Named
    │    │    │    Spans       Entities
   text lemma pos
   idx  dep   ent_type

doc = nlp("Apple is looking at buying U.K. startup for $1 billion")

doc.text          → "Apple is looking at buying U.K. startup for $1 billion"
doc[0].text       → "Apple"
doc[0].pos_       → "PROPN"
doc[0].dep_       → "nsubj"
doc[0].ent_type_  → "ORG"
doc.ents          → (Apple, U.K., $1 billion)
list(doc.sents)   → [Apple is looking at buying U.K. startup for $1 billion]

Concept Summary Table

Concept	What You Must Internalize
Tokenization	How text is split into meaningful units, handling contractions, punctuation, and special cases
POS Tagging	Grammatical categories of words and why context matters for ambiguous words
Named Entity Recognition	Identifying and classifying real-world entities in unstructured text
Dependency Parsing	The tree structure of sentences showing grammatical relationships
Lemmatization	Reducing words to their dictionary form while preserving meaning
Word Vectors	Dense numerical representations that capture semantic meaning
Pipeline Architecture	How spaCy processes text through a series of modular components
Training & Fine-tuning	Adapting models to domain-specific data and custom entity types

Deep Dive Reading by Concept

Primary Resources

Concept	Resource & Chapter
spaCy Fundamentals	“Natural Language Processing with Python and spaCy” by Yuli Vasiliev (No Starch Press) — Ch. 1-3
Tokenization & Preprocessing	Vasiliev Ch. 2: “Working with spaCy”
POS Tagging & Morphology	Vasiliev Ch. 4: “Extracting and Using Linguistic Features”
NER & Entity Recognition	Vasiliev Ch. 5: “Working with Word Vectors and Semantic Similarity”
Dependency Parsing	Vasiliev Ch. 6: “Finding Patterns and Walking the Dependency Tree”
Custom Training	spaCy Documentation: “Training Pipelines”
Transformers Integration	spaCy Documentation: “spacy-transformers”

Supplementary Resources

Topic	Resource
NLP Theory	“Speech and Language Processing” by Jurafsky & Martin (free online) — Chapters on POS, NER, Parsing
Practical NLP	freeCodeCamp NLP Course by Dr. William Mattingly (YouTube/freeCodeCamp)
Deep Learning NLP	“Natural Language Processing with Transformers” by Tunstall, von Werra, Wolf (O’Reilly)
spaCy Internals	Explosion AI Blog (explosion.ai/blog)

Essential Reading Order

Week 1: Foundations

Vasiliev Ch. 1-2 (spaCy basics)
spaCy 101 documentation

Week 2: Linguistic Features

Vasiliev Ch. 4 (POS, morphology)
Jurafsky Ch. 8 (POS tagging theory)

Week 3: Entities & Structure

Vasiliev Ch. 5-6 (NER, dependencies)
freeCodeCamp NER tutorials

Week 4-6: Advanced Topics

spaCy training documentation
Transformer integration guides

Quick Start Guide (First 48 Hours)

If you’re feeling overwhelmed, here’s your minimal path:

Day 1 (4 hours):

Install spaCy and download en_core_web_sm
Complete Project 1 (Text Tokenizer)
Read Vasiliev Ch. 1

Day 2 (4 hours):

Complete Project 2 (POS Tagger)
Complete Project 3 (Entity Extractor)
Experiment with your own text samples

After 48 hours, you’ll have working code for tokenization, POS tagging, and NER—the core building blocks for everything else.

Recommended Learning Paths

Path A: Complete Beginner (12 weeks)

Weeks 1-2: Projects 1-3 (Fundamentals)
Weeks 3-4: Projects 4-6 (Analysis)
Weeks 5-6: Projects 7-9 (Applications)
Weeks 7-8: Projects 10-12 (Advanced)
Weeks 9-10: Projects 13-15 (Integration)
Weeks 11-12: Projects 16-18 (Production)

Path B: Python Developer (6 weeks)

Week 1: Projects 1-4 (fast-track fundamentals)
Week 2: Projects 5-7 (core applications)
Week 3: Projects 8-10 (custom training)
Week 4: Projects 11-13 (advanced NLP)
Week 5: Projects 14-16 (transformers)
Week 6: Projects 17-18 (production)

Path C: Data Scientist (4 weeks)

Focus on Projects: 1, 3, 5, 7, 10, 11, 14, 16, 18 Skip detailed linguistics, emphasize ML integration and production.

Project 1: The Smart Text Tokenizer

Main Programming Language: Python
Libraries: spaCy
Difficulty: Beginner
Time Estimate: 3-4 hours

What you’ll build: A command-line tool that tokenizes text files and outputs detailed token information including text, lemma, POS tag, and entity type. It will handle edge cases like contractions, URLs, emails, and special characters.

Why it teaches the concept: Tokenization is the foundation of all NLP. You’ll understand how spaCy breaks text into meaningful units and why naive splitting by spaces fails for real text.

Core challenges you’ll face:

Contractions → Understanding why “don’t” becomes [“do”, “n’t”]
Special tokens → Handling URLs, emails, hashtags, @mentions
Sentence boundaries → Distinguishing “Dr.” from end-of-sentence periods
Unicode handling → Processing emojis and non-ASCII characters

Key Concepts:

Token attributes: text, lemma_, pos_, dep_, ent_type_
Tokenizer exceptions: spaCy’s special case handling
Custom tokenization: Adding your own rules

Prerequisites: Python basics, command-line familiarity

Real World Outcome

A CLI tool that processes any text file and outputs structured token analysis:

$ python tokenizer.py --input document.txt --output tokens.json

Processing: document.txt
Total tokens: 1,247
Unique tokens: 489
Sentences: 52

$ python tokenizer.py --text "Dr. Smith's email is john.smith@company.com"

Token Analysis:
┌──────────────────────┬────────┬─────────┬──────────┐
│ Text                 │ Lemma  │ POS     │ Entity   │
├──────────────────────┼────────┼─────────┼──────────┤
│ Dr.                  │ Dr.    │ PROPN   │          │
│ Smith                │ Smith  │ PROPN   │ PERSON   │
│ 's                   │ 's     │ PART    │          │
│ email                │ email  │ NOUN    │          │
│ is                   │ be     │ AUX     │          │
│ john.smith@company.com│ john.smith@company.com│ X │          │
└──────────────────────┴────────┴─────────┴──────────┘

Special Tokens Detected:
- 1 email address
- 1 abbreviation (Dr.)

The Core Question You’re Answering

“How does a computer break human language into discrete, processable units while preserving meaning?”

Humans read “don’t” as a single word with a specific meaning. But for computation, we need to recognize it contains “do” + “not”—two separate semantic units. Tokenization is the bridge between human text and machine understanding.

Concepts You Must Understand First

Self-assessment questions:

“What’s the difference between a string and a list of strings in Python?”
“How would you iterate through a file line by line?”
“What is Unicode and why might ‘café’ have different byte lengths?”

If you can’t answer these, review:

Vasiliev Ch. 1: “Introducing spaCy”
Python documentation on strings and file I/O

Questions to Guide Your Design

Input Handling:

How will you handle files vs. direct text input?
What encoding should you assume (UTF-8)?
How large might the input be? Do you need streaming?

Output Format:

What information should each token include?
How will you represent the output (JSON, CSV, table)?
Should you include sentence boundaries?

Edge Cases:

What happens with empty input?
How do you handle binary files accidentally passed as input?
What about extremely long “words” (like URLs)?

Thinking Exercise

Before writing code, manually tokenize this text:

"The U.S. hasn't seen inflation like this since the '70s.
Email me at test@example.com for the full report (PDF, 2.5MB)."

Write down:

How many tokens do you expect?
Which tokens are “tricky” and why?
Where are the sentence boundaries?

Now run it through spaCy and compare. What surprised you?

The Interview Questions They’ll Ask

“Why doesn’t spaCy just split on whitespace and punctuation?”
“How does spaCy handle contractions differently than NLTK?”
“What’s the difference between a token and a word?”
“How would you customize the tokenizer for a specific domain (e.g., medical abbreviations)?”
“What’s the computational complexity of tokenization in spaCy?”

Hints in Layers

Hint 1: Getting Started Start with nlp = spacy.load("en_core_web_sm") and doc = nlp(text). Iterate through doc to access tokens.

Hint 2: Token Attributes Each token has attributes: token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_. The underscore versions give human-readable strings.

Hint 3: Sentence Segmentation Use doc.sents to iterate through sentences. Each sentence is a Span object containing tokens.

Hint 4: Output Formatting Use the tabulate library for nice CLI tables, or json.dumps() with indent=2 for structured output.

Books That Will Help

Topic	Book & Chapter
spaCy Tokenization	Vasiliev Ch. 2: “Getting Started”
Tokenization Theory	Jurafsky Ch. 2: “Regular Expressions, Text Normalization”
Unicode Handling	“Fluent Python” by Ramalho — Ch. 4: “Text vs. Bytes”

Common Pitfalls & Debugging

Problem	Root Cause	Fix
`OSError: Can't find model 'en_core_web_sm'`	Model not downloaded	Run `python -m spacy download en_core_web_sm`
Contractions not split	Using wrong model or custom tokenizer	Ensure using standard spaCy model
Memory error on large file	Loading entire file at once	Process in chunks using `nlp.pipe()`
Weird characters in output	Encoding mismatch	Ensure UTF-8 encoding when reading files

Learning Milestones

Milestone 1: Basic Tokenization You can tokenize a sentence and print each token’s text and POS tag.

Milestone 2: File Processing You can process entire documents and output structured JSON with token counts and statistics.

Milestone 3: Edge Case Handling You understand why spaCy makes specific tokenization decisions and can explain them.

Project 2: The Grammar Detective (POS Tagger)

Main Programming Language: Python
Libraries: spaCy, matplotlib
Difficulty: Beginner
Time Estimate: 4-5 hours

What you’ll build: A tool that analyzes text and visualizes the distribution of parts of speech. It will identify patterns like noun-heavy technical documents vs. verb-heavy narratives, and highlight grammatically interesting constructions.

Why it teaches the concept: POS tagging is the gateway to understanding sentence structure. You’ll see how the same word (“bank”) can be a noun or verb depending on context, and how statistical models resolve this ambiguity.

Core challenges you’ll face:

Ambiguity resolution → How “flies” is tagged differently in “time flies” vs “fruit flies”
Fine-grained vs coarse tags → Understanding NN vs NNS vs NNP
Visualization → Creating meaningful charts of POS distributions
Comparative analysis → Comparing writing styles across documents

Key Concepts:

Universal POS tags: NOUN, VERB, ADJ, etc.
Penn Treebank tags: NN, VB, JJ, etc. (fine-grained)
Morphological features: Number, tense, case

Prerequisites: Project 1 completed

Real World Outcome

A CLI tool that produces POS analysis and visualizations:

$ python pos_analyzer.py --input hemingway.txt

POS Distribution for: hemingway.txt
┌──────────┬───────┬────────────┐
│ POS Tag  │ Count │ Percentage │
├──────────┼───────┼────────────┤
│ NOUN     │ 2,341 │ 24.8%      │
│ VERB     │ 1,876 │ 19.9%      │
│ DET      │ 1,234 │ 13.1%      │
│ ADJ      │ 892   │ 9.5%       │
│ PRON     │ 756   │ 8.0%       │
│ ADP      │ 721   │ 7.6%       │
│ ...      │ ...   │ ...        │
└──────────┴───────┴────────────┘

Writing Style Analysis:
- Noun-to-verb ratio: 1.25 (narrative style)
- Adjective density: 9.5% (sparse description)
- Average sentence length: 12.3 words

Saved visualization: hemingway_pos_distribution.png

$ python pos_analyzer.py --compare hemingway.txt faulkner.txt

Comparative Analysis:
┌────────────┬───────────┬───────────┐
│ Feature    │ Hemingway │ Faulkner  │
├────────────┼───────────┼───────────┤
│ Adj/Noun   │ 0.38      │ 0.67      │
│ Avg Sent   │ 12.3      │ 34.7      │
│ Passive %  │ 8.2%      │ 18.4%     │
└────────────┴───────────┴───────────┘

The Core Question You’re Answering

“How does context determine the grammatical role of a word?”

The word “book” is a noun in “I read a book” but a verb in “Please book the flight.” Statistical POS taggers learn these patterns from massive amounts of annotated text, achieving 97%+ accuracy on standard English.

Thinking Exercise

Consider the sentence: “The old man the boats.”

On first reading, what POS tags would you assign?
Re-read it. What’s the actual meaning? (Hint: it’s a complete sentence)
How do the POS tags change with the correct interpretation?

This is a “garden path sentence”—it tricks readers into the wrong parse. Understanding such cases reveals how sophisticated language processing must be.

The Interview Questions They’ll Ask

“What’s the difference between POS tagging and parsing?”
“How does spaCy’s tagger handle out-of-vocabulary words?”
“Why might a POS tagger fail on social media text?”
“Explain the difference between fine-grained and universal POS tags.”
“How would you evaluate POS tagger accuracy?”

Hints in Layers

Hint 1: Basic Access Use token.pos_ for universal tags (NOUN, VERB) and token.tag_ for fine-grained Penn Treebank tags (NN, VBZ).

Hint 2: Counting Use collections.Counter to count POS frequencies: Counter(token.pos_ for token in doc)

Hint 3: Visualization Use matplotlib’s bar() function to create distribution charts. Sort by frequency for readability.

Hint 4: Morphology Access token.morph for detailed features like Number=Plur|Person=3.

Common Pitfalls & Debugging

Problem	Root Cause	Fix
Wrong tags for proper nouns	Case sensitivity	spaCy handles this automatically; check your input
“X” tags everywhere	Non-standard text	Common for URLs, emojis; filter or handle separately
Inconsistent tags across runs	Model randomness	Use `nlp.pipe()` with consistent settings

Project 3: The Entity Extractor (NER System)

Main Programming Language: Python
Libraries: spaCy, displacy
Difficulty: Beginner-Intermediate
Time Estimate: 5-6 hours

What you’ll build: A named entity extraction system that identifies people, organizations, locations, dates, and monetary values in text. It will output structured data suitable for knowledge graphs and visualize entities in HTML.

Why it teaches the concept: NER is one of the most practically valuable NLP tasks. You’ll understand how statistical models identify entity boundaries and classify them, and why context is crucial (“Apple” the company vs. “apple” the fruit).

Core challenges you’ll face:

Entity boundary detection → Where does “New York City” start and end?
Entity classification → Is “Amazon” a company, river, or mythological group?
Nested entities → “Bank of America” contains both ORG and GPE
Visualization → Creating readable entity highlighting

Key Concepts:

Entity types: PERSON, ORG, GPE, DATE, MONEY, etc.
BIO tagging: Begin-Inside-Outside scheme for entity boundaries
Entity linking: Connecting mentions to knowledge bases

Prerequisites: Projects 1-2 completed

Real World Outcome

$ python entity_extractor.py --input news_article.txt

Named Entities Found:
┌──────────────────────┬─────────┬───────────────────────────────┐
│ Entity               │ Type    │ Context                       │
├──────────────────────┼─────────┼───────────────────────────────┤
│ Tim Cook             │ PERSON  │ "...CEO Tim Cook announced..."│
│ Apple                │ ORG     │ "Apple is launching..."       │
│ Cupertino            │ GPE     │ "...headquarters in Cupertino"│
│ September 15th       │ DATE    │ "...available September 15th" │
│ $999                 │ MONEY   │ "...starting at $999..."      │
└──────────────────────┴─────────┴───────────────────────────────┘

Entity Statistics:
- PERSON: 3
- ORG: 7
- GPE: 4
- DATE: 2
- MONEY: 3

Output saved: entities.json
Visualization: entities.html (open in browser)

$ cat entities.json
{
  "entities": [
    {"text": "Tim Cook", "type": "PERSON", "start": 45, "end": 53},
    {"text": "Apple", "type": "ORG", "start": 0, "end": 5},
    ...
  ]
}

The Core Question You’re Answering

“How does a machine identify and classify named entities in unstructured text?”

Humans effortlessly recognize “Barack Obama” as a person and “Paris” as a place. But teaching machines this requires understanding that entity recognition is fundamentally a sequence labeling problem—determining both boundaries and categories.

Thinking Exercise

Consider this text:

"The Washington Post reported that Washington denied the claims.
George Washington would have been surprised."

Identify:

Which “Washington” is a newspaper (ORG)?
Which is a city/government (GPE)?
Which is a person (PERSON)?

What clues help you distinguish them? How might a machine learn these patterns?

The Interview Questions They’ll Ask

“What’s the difference between NER and entity linking?”
“How does spaCy handle overlapping entities?”
“What is the BIO tagging scheme and why is it used?”
“How would you handle NER for a language without spaces between words (like Chinese)?”
“What’s precision vs. recall in the context of NER evaluation?”

Hints in Layers

Hint 1: Accessing Entities Use doc.ents to get all entities. Each entity has .text, .label_, .start_char, .end_char.

Hint 2: Visualization Use displacy.render(doc, style="ent") to generate HTML visualization. Add jupyter=True in notebooks.

Hint 3: Structured Output Convert entities to dictionaries: [{"text": e.text, "type": e.label_, "start": e.start_char, "end": e.end_char} for e in doc.ents]

Hint 4: Entity Spans Entities are Span objects. Access tokens within: for token in entity: print(token)

Project 4: The Dependency Visualizer

Main Programming Language: Python
Libraries: spaCy, displacy, networkx
Difficulty: Intermediate
Time Estimate: 6-7 hours

What you’ll build: A tool that parses sentences and visualizes their grammatical structure as dependency trees. It will identify subjects, objects, modifiers, and help users understand how words relate to each other.

Why it teaches the concept: Dependency parsing reveals the deep structure of language. You’ll understand why “The dog bit the man” and “The man bit the dog” have identical words but opposite meanings, and how machines understand these relationships.

Core challenges you’ll face:

Tree visualization → Rendering complex trees readably
Relation extraction → Finding subject-verb-object triples
Complex sentences → Handling coordination, subordination
Question answering → Using dependencies to answer “who did what to whom”

Key Concepts:

Dependency relations: nsubj, dobj, prep, pobj, etc.
Head-dependent relationships: Each word has one head
Projective parsing: Why crossing dependencies are problematic

Prerequisites: Projects 1-3 completed

Real World Outcome

$ python dep_visualizer.py --sentence "The quick brown fox jumps over the lazy dog"

Dependency Parse:
┌────────┬─────────┬────────┬────────────┐
│ Token  │ Dep Rel │ Head   │ Children   │
├────────┼─────────┼────────┼────────────┤
│ The    │ det     │ fox    │ []         │
│ quick  │ amod    │ fox    │ []         │
│ brown  │ amod    │ fox    │ []         │
│ fox    │ nsubj   │ jumps  │ [The,quick,brown] │
│ jumps  │ ROOT    │ jumps  │ [fox,over] │
│ over   │ prep    │ jumps  │ [dog]      │
│ the    │ det     │ dog    │ []         │
│ lazy   │ amod    │ dog    │ []         │
│ dog    │ pobj    │ over   │ [the,lazy] │
└────────┴─────────┴────────┴────────────┘

Subject-Verb-Object Triples:
- fox → jumps → (over dog)

Visualization saved: dependency_tree.svg
HTML visualization: dependency_parse.html

$ python dep_visualizer.py --question "Who jumps over what?"
Answer: "fox jumps over dog"

The Core Question You’re Answering

“How do we represent the grammatical structure of a sentence in a way machines can reason about?”

Unlike humans who process language holistically, machines need explicit structure. Dependency parsing provides a graph representation where we can traverse from any word to understand its role and relationships.

Thinking Exercise

Parse this sentence manually:

"The cat that I saw yesterday caught a mouse"

What is the root verb?
What is the subject of “caught”?
What is the subject of “saw”?
Draw the dependency tree

Now verify with spaCy. Pay attention to how relative clauses are handled.

The Interview Questions They’ll Ask

“What’s the difference between constituency parsing and dependency parsing?”
“How would you extract all subject-verb-object triples from a document?”
“What is a projective dependency tree?”
“How does spaCy’s parser handle ambiguity?”
“What’s the time complexity of transition-based dependency parsing?”

Hints in Layers

Hint 1: Basic Navigation Each token has token.head (its parent) and token.children (dependents). The root has token.head == token.

Hint 2: Relation Labels token.dep_ gives the dependency relation. Key relations: nsubj (subject), dobj (direct object), prep (preposition).

Hint 3: Tree Traversal Write a recursive function: def traverse(token, depth=0) that prints token and calls itself on children.

Hint 4: Extracting Triples Find verbs (token.pos_ == "VERB"), then look for nsubj and dobj children to form (subject, verb, object) triples.

Project 5: The Lemmatizer vs Stemmer Showdown

Main Programming Language: Python
Libraries: spaCy, NLTK (for comparison)
Difficulty: Intermediate
Time Estimate: 4-5 hours

What you’ll build: A comparative tool that applies both lemmatization (spaCy) and stemming (Porter/Snowball) to text, showing the differences and when each is appropriate.

Why it teaches the concept: Both reduce words to base forms, but lemmatization uses vocabulary and morphology while stemming uses rules. Understanding the tradeoffs is crucial for information retrieval and text normalization.

Core challenges you’ll face:

Lemma vs stem differences → “better” → “good” (lemma) vs “better” (stem)
Context-dependent lemmatization → “meeting” as noun vs verb
Performance comparison → Stemming is faster but less accurate
Use case analysis → When to use which approach

Key Concepts:

Lemmatization: Dictionary-based reduction to base form
Stemming: Rule-based suffix stripping
Morphological analysis: Understanding word forms

Prerequisites: Projects 1-2 completed

Real World Outcome

$ python lemma_vs_stem.py --input text.txt

Lemmatization vs Stemming Comparison
┌────────────┬────────────┬───────────┬──────────────┐
│ Original   │ Lemma      │ Stem      │ Difference?  │
├────────────┼────────────┼───────────┼──────────────┤
│ running    │ run        │ run       │ No           │
│ better     │ good       │ better    │ Yes          │
│ mice       │ mouse      │ mice      │ Yes          │
│ studies    │ study      │ studi     │ Yes          │
│ meeting    │ meeting/meet│ meet     │ Context-dep  │
│ happily    │ happily    │ happili   │ Yes          │
└────────────┴────────────┴───────────┴──────────────┘

Statistics:
- Words where lemma ≠ stem: 34.2%
- Lemmatization time: 0.023s
- Stemming time: 0.008s

Recommendation for your text:
→ Use lemmatization: High accuracy needs, linguistic applications
→ Use stemming: Speed-critical search indexing

The Core Question You’re Answering

“How do we normalize word variations while preserving linguistic meaning?”

“Run”, “runs”, “running”, “ran” are all forms of the same concept. But how do we know “better” relates to “good”? Lemmatization uses linguistic knowledge while stemming uses patterns. Understanding both reveals fundamental NLP tradeoffs.

Thinking Exercise

For each word, predict the lemma and stem:

“geese” → lemma: ? stem: ?
“unable” → lemma: ? stem: ?
“saw” (past tense of “see”) → lemma: ?
“saw” (cutting tool) → lemma: ?

What does this tell you about the limitations of stemming?

The Interview Questions They’ll Ask

“When would you use stemming over lemmatization?”
“How does spaCy determine the correct lemma for ambiguous words?”
“What’s the Porter Stemmer algorithm?”
“Can you lemmatize without POS information? What are the tradeoffs?”
“How would you handle lemmatization for a low-resource language?”

Project 6: The Sentiment Analyzer

Main Programming Language: Python
Libraries: spaCy, spacy-textblob or custom classifier
Difficulty: Intermediate
Time Estimate: 6-8 hours

What you’ll build: A sentiment analysis system that classifies text as positive, negative, or neutral, with confidence scores and explanations of which words/phrases drive the sentiment.

Why it teaches the concept: Sentiment analysis is one of the most commercially valuable NLP applications. You’ll understand both lexicon-based approaches and how to train custom classifiers for domain-specific sentiment.

Core challenges you’ll face:

Negation handling → “not bad” is positive, not negative
Sarcasm and irony → “Oh great, another meeting” is negative
Domain adaptation → “sick” is negative in healthcare, positive in slang
Aspect-based sentiment → “Great food but terrible service”

Key Concepts:

Lexicon-based sentiment: Using word lists with polarity scores
Machine learning approaches: Training classifiers on labeled data
Sentiment intensity: Beyond binary positive/negative

Prerequisites: Projects 1-4 completed

Real World Outcome

$ python sentiment_analyzer.py --text "The movie was absolutely fantastic! Great acting and storyline."

Sentiment Analysis Results:
┌───────────────────────────────────────────────────────────┐
│ Overall Sentiment: POSITIVE (Confidence: 0.92)            │
├───────────────────────────────────────────────────────────┤
│ Polarity Score: +0.85 (range: -1.0 to +1.0)              │
│ Subjectivity: 0.78 (range: 0.0 to 1.0)                   │
└───────────────────────────────────────────────────────────┘

Word-Level Sentiment:
┌───────────────┬───────────┬───────────┐
│ Word          │ Polarity  │ Intensity │
├───────────────┼───────────┼───────────┤
│ fantastic     │ +0.9      │ strong    │
│ great         │ +0.8      │ moderate  │
└───────────────┴───────────┴───────────┘

$ python sentiment_analyzer.py --file reviews.csv --aspect-based

Aspect-Based Sentiment (100 reviews):
┌──────────────┬───────────┬───────────┬───────────┐
│ Aspect       │ Positive  │ Negative  │ Neutral   │
├──────────────┼───────────┼───────────┼───────────┤
│ food         │ 78        │ 12        │ 10        │
│ service      │ 45        │ 42        │ 13        │
│ price        │ 23        │ 55        │ 22        │
│ ambiance     │ 67        │ 18        │ 15        │
└──────────────┴───────────┴───────────┴───────────┘

The Core Question You’re Answering

“How do we automatically determine the emotional tone of text?”

Human language is nuanced—”I don’t hate it” technically has positive sentiment but weak enthusiasm. Understanding how machines approximate human judgment of tone reveals both the power and limitations of NLP.

Thinking Exercise

Rate these sentences from -1 (very negative) to +1 (very positive):

“This is the best pizza I’ve ever had!”
“The pizza was okay.”
“I’ve had better pizza from a frozen box.”
“Not the worst pizza I’ve eaten.”

Now imagine writing rules to handle all these cases. What patterns emerge? What makes it hard?

The Interview Questions They’ll Ask

“How would you handle negation in sentiment analysis?”
“What’s the difference between document-level and aspect-based sentiment?”
“How would you build a sentiment classifier for a new domain with limited labeled data?”
“What are the limitations of lexicon-based sentiment analysis?”
“How do you handle sarcasm and irony in sentiment analysis?”

Project 7: The Text Classifier

Main Programming Language: Python
Libraries: spaCy, scikit-learn
Difficulty: Intermediate
Time Estimate: 8-10 hours

What you’ll build: A multi-class text classification system that categorizes documents into predefined categories (e.g., news topics, support tickets, email types). You’ll train custom models and evaluate their performance.

Why it teaches the concept: Text classification is fundamental to applications like spam filtering, content moderation, and routing. You’ll understand feature extraction, model training, and evaluation metrics in NLP.

Core challenges you’ll face:

Feature engineering → Converting text to numerical features
Class imbalance → Handling categories with few examples
Evaluation → Precision, recall, F1 for multi-class problems
Model selection → Comparing approaches (Naive Bayes, SVM, neural)

Key Concepts:

Text vectorization: TF-IDF, word embeddings
spaCy TextCategorizer: Built-in classification component
Cross-validation: Reliable performance estimation

Prerequisites: Projects 1-6 completed, basic ML knowledge helpful

Real World Outcome

$ python text_classifier.py train --data training_data.json --model news_classifier

Training text classifier...
Categories: ['politics', 'sports', 'technology', 'entertainment', 'business']
Training samples: 10,000
Validation samples: 2,000

Training Progress:
Epoch 1: Loss=2.34, Val Accuracy=0.72
Epoch 5: Loss=0.89, Val Accuracy=0.87
Epoch 10: Loss=0.45, Val Accuracy=0.91

Model saved: news_classifier/

$ python text_classifier.py predict --model news_classifier --text "Apple announces new iPhone with revolutionary camera"

Prediction Results:
┌─────────────────┬────────────┐
│ Category        │ Confidence │
├─────────────────┼────────────┤
│ technology      │ 0.89       │
│ business        │ 0.08       │
│ entertainment   │ 0.02       │
│ politics        │ 0.01       │
│ sports          │ 0.00       │
└─────────────────┴────────────┘

$ python text_classifier.py evaluate --model news_classifier --data test_data.json

Classification Report:
              precision    recall  f1-score   support
    politics      0.92      0.89     0.90       412
      sports      0.95      0.97     0.96       389
  technology      0.89      0.91     0.90       445
entertainment     0.87      0.84     0.85       367
    business      0.88      0.90     0.89       387

    accuracy                         0.90      2000

The Core Question You’re Answering

“How do we teach a machine to recognize what a document is ‘about’?”

Unlike keyword matching, classification must understand that “Apple’s stock price surged” is about business, not fruit, even though “apple” appears. The model must learn semantic patterns from examples.

Thinking Exercise

You’re building a support ticket classifier with these categories:

Billing
Technical Issue
Feature Request
Account Access

Consider this ticket: “I can’t log in to pay my bill”

Which category does it belong to?
What features would help classify it?
How would you handle tickets that span multiple categories?

The Interview Questions They’ll Ask

“What’s the difference between text classification and named entity recognition?”
“How do you handle class imbalance in text classification?”
“Explain TF-IDF and why it’s useful for text classification.”
“What’s the cold start problem in classification and how would you address it?”
“How would you deploy a text classifier to handle 10,000 requests per second?”

Hints in Layers

Hint 1: Data Format spaCy expects training data as: [("text", {"cats": {"category1": 1.0, "category2": 0.0}})]

Hint 2: TextCategorizer Component Add to pipeline: textcat = nlp.add_pipe("textcat") then textcat.add_label("category_name")

Hint 3: Training Loop Use nlp.update() with batches of examples. Track loss to monitor convergence.

Hint 4: Evaluation Use scikit-learn’s classification_report() for detailed precision/recall/F1.

Project 8: Custom NER Model Training

Main Programming Language: Python
Libraries: spaCy
Difficulty: Intermediate-Advanced
Time Estimate: 10-12 hours

What you’ll build: A custom named entity recognition model trained on domain-specific data. You’ll annotate training data, train the model, and evaluate its performance on your specific entity types.

Why it teaches the concept: Real-world NER often requires custom entities not in pre-trained models (product names, medical terms, legal entities). You’ll understand the full training pipeline: annotation, training, evaluation, and iteration.

Core challenges you’ll face:

Data annotation → Creating high-quality training examples
Entity boundary consistency → Ensuring consistent spans
Training configuration → Optimizing hyperparameters
Model evaluation → Measuring precision, recall, F1

Key Concepts:

spaCy training config: config.cfg structure
Training data format: DocBin and JSON formats
Transfer learning: Starting from pre-trained models

Prerequisites: Project 3 completed, understanding of ML training basics

Real World Outcome

$ python custom_ner.py prepare --input raw_documents/ --output training_data/

Annotation Guidelines:
- PRODUCT: Product names (iPhone, MacBook Pro)
- FEATURE: Feature names (Face ID, Retina Display)
- SPEC: Technical specifications (256GB, A15 chip)

Starting annotation tool on http://localhost:8080

$ python custom_ner.py train --config config.cfg --data training_data/

Training Custom NER Model
Entity types: ['PRODUCT', 'FEATURE', 'SPEC']
Training examples: 500
Validation examples: 100

Training Progress:
Step 0: Loss=25.4, NER P=0.12, R=0.08, F1=0.09
Step 1000: Loss=3.2, NER P=0.78, R=0.72, F1=0.75
Step 2000: Loss=1.1, NER P=0.89, R=0.85, F1=0.87

Best model saved: models/custom_ner_best/

$ python custom_ner.py predict --model models/custom_ner_best --text "The new MacBook Pro with M2 chip and 512GB SSD"

Custom Entities Found:
┌──────────────┬─────────┬───────────┬───────────┐
│ Entity       │ Type    │ Start     │ End       │
├──────────────┼─────────┼───────────┼───────────┤
│ MacBook Pro  │ PRODUCT │ 8         │ 19        │
│ M2 chip      │ SPEC    │ 25        │ 32        │
│ 512GB SSD    │ SPEC    │ 37        │ 47        │
└──────────────┴─────────┴───────────┴───────────┘

The Core Question You’re Answering

“How do we teach a machine to recognize entity types it’s never seen before?”

Pre-trained models know about generic entities, but your domain has specific needs. Training custom NER requires understanding what the model learns (patterns, context) and how to provide sufficient examples for generalization.

Thinking Exercise

You need to train a NER model for legal documents with these entities:

CASE_NUMBER (e.g., “2024-CV-1234”)
PARTY_NAME (e.g., “Smith v. Jones”)
LEGAL_TERM (e.g., “habeas corpus”, “prima facie”)

For each entity type:

How many training examples would you need?
What patterns would the model learn?
What edge cases might be difficult?

The Interview Questions They’ll Ask

“How much training data do you need for custom NER?”
“What’s the difference between fine-tuning and training from scratch?”
“How do you handle entities that overlap or nest?”
“What strategies help when you have very few labeled examples?”
“How do you evaluate NER model performance in production?”

Hints in Layers

Hint 1: Data Format Use spaCy’s DocBin format for training: each example needs text and entity spans (start, end, label).

Hint 2: Config File Start with python -m spacy init config config.cfg --lang en --pipeline ner and customize.

Hint 3: Training Command Use python -m spacy train config.cfg --output ./models --paths.train ./train.spacy --paths.dev ./dev.spacy

Hint 4: Annotation Tools Consider using Prodigy (commercial) or doccano (open source) for efficient annotation.

Project 9: The Similarity Engine

Main Programming Language: Python
Libraries: spaCy, numpy
Difficulty: Intermediate
Time Estimate: 6-8 hours

What you’ll build: A document similarity system that finds semantically related documents using word vectors. It will support queries like “find documents similar to this one” and power applications like related article recommendations.

Why it teaches the concept: Understanding word vectors (embeddings) is fundamental to modern NLP. You’ll learn how meaning is captured in high-dimensional space and how similarity computations work.

Core challenges you’ll face:

Vector aggregation → Combining word vectors into document vectors
Similarity metrics → Cosine similarity vs. Euclidean distance
Efficient search → Handling large document collections
Out-of-vocabulary words → What happens when a word has no vector?

Key Concepts:

Word embeddings: Dense vector representations
Document vectors: Aggregating word vectors
Cosine similarity: Measuring vector angles

Prerequisites: Projects 1-3 completed, basic linear algebra helpful

Real World Outcome

$ python similarity_engine.py index --documents articles/

Indexing 10,000 documents...
Building document vectors...
Index saved: document_index.pkl

$ python similarity_engine.py search --query "machine learning applications in healthcare"

Most Similar Documents:
┌─────┬───────────────────────────────────────────┬────────────┐
│ Rank│ Document                                  │ Similarity │
├─────┼───────────────────────────────────────────┼────────────┤
│ 1   │ AI_Healthcare_Diagnosis.txt               │ 0.89       │
│ 2   │ ML_Medical_Imaging.txt                    │ 0.85       │
│ 3   │ Deep_Learning_Drug_Discovery.txt          │ 0.82       │
│ 4   │ NLP_Clinical_Notes.txt                    │ 0.79       │
│ 5   │ Healthcare_Analytics_Overview.txt         │ 0.76       │
└─────┴───────────────────────────────────────────┴────────────┘

$ python similarity_engine.py similar --document articles/AI_Healthcare_Diagnosis.txt --n 5

Documents similar to: AI_Healthcare_Diagnosis.txt
1. ML_Medical_Imaging.txt (0.91)
2. Radiology_AI_Applications.txt (0.87)
...

The Core Question You’re Answering

“How do we measure the semantic similarity between pieces of text?”

Traditional keyword matching fails when documents use different words for the same concept. Word vectors capture meaning: “doctor” and “physician” are close in vector space even though they share no characters. This enables semantic search.

Thinking Exercise

Consider these document pairs:

“The cat sat on the mat” vs “A feline rested on the rug”
“Apple announces new iPhone” vs “Orange introduces new smartphone”
“Bank robbery downtown” vs “Financial institution branch opening”

For each pair:

What would keyword overlap score?
What would vector similarity score (estimate)?
Why do they differ?

The Interview Questions They’ll Ask

“How are word vectors trained (Word2Vec, GloVe)?”
“What’s the difference between static embeddings and contextual embeddings?”
“How would you compute document similarity for millions of documents efficiently?”
“What are the limitations of averaging word vectors?”
“How would you handle out-of-vocabulary words in a similarity system?”

Hints in Layers

Hint 1: Accessing Vectors Use token.vector for word vectors and doc.vector for the averaged document vector. Requires en_core_web_md or larger.

Hint 2: Similarity Computation Use doc1.similarity(doc2) or compute cosine similarity manually: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Hint 3: Efficient Search For large collections, consider approximate nearest neighbor libraries like FAISS or Annoy.

Hint 4: Better Document Vectors Weight by TF-IDF or use sentence transformers for better document representations.

Project 10: The Keyword Extractor

Main Programming Language: Python
Libraries: spaCy, pytextrank or custom
Difficulty: Intermediate
Time Estimate: 5-7 hours

What you’ll build: An automatic keyword and keyphrase extraction system using multiple algorithms (TextRank, TF-IDF, RAKE). It will identify the most important terms in documents for tagging, summarization, and SEO.

Why it teaches the concept: Keyword extraction combines multiple NLP techniques: POS tagging (to identify candidates), graph algorithms (TextRank), and statistical measures (TF-IDF). It’s a practical application with many use cases.

Core challenges you’ll face:

Candidate selection → Which phrases could be keywords?
Ranking algorithms → TextRank vs. TF-IDF vs. RAKE
Phrase extraction → Multi-word keywords vs. single words
Domain adaptation → Different domains have different important terms

Key Concepts:

TextRank algorithm: Graph-based ranking
TF-IDF for keywords: Term importance in corpus
RAKE: Rapid Automatic Keyword Extraction

Prerequisites: Projects 1-4 completed

Real World Outcome

$ python keyword_extractor.py --input research_paper.txt --method all

Keyword Extraction Results for: research_paper.txt

TextRank Keywords:
┌─────┬──────────────────────────┬───────────┐
│ Rank│ Keyword/Phrase           │ Score     │
├─────┼──────────────────────────┼───────────┤
│ 1   │ neural network           │ 0.089     │
│ 2   │ deep learning            │ 0.076     │
│ 3   │ attention mechanism      │ 0.068     │
│ 4   │ transformer architecture │ 0.054     │
│ 5   │ language model           │ 0.051     │
└─────┴──────────────────────────┴───────────┘

TF-IDF Keywords:
1. transformer (0.234)
2. attention (0.198)
3. embedding (0.167)
...

RAKE Keywords:
1. self-attention mechanism (score: 12.5)
2. pre-trained language model (score: 11.2)
...

Agreement Analysis:
- Keywords appearing in all methods: neural network, deep learning, transformer
- Method correlation: TextRank-TFIDF=0.78, TextRank-RAKE=0.65

The Core Question You’re Answering

“How do we automatically identify the most important concepts in a document?”

Humans can quickly identify what a document is “about.” Teaching machines this requires understanding both statistical measures (term frequency) and structural patterns (word co-occurrence graphs).

Thinking Exercise

Given this abstract:

"Deep learning has revolutionized natural language processing. Transformer models
like BERT and GPT have achieved state-of-the-art results on many benchmarks.
The attention mechanism allows these models to capture long-range dependencies."

What keywords would you extract?
Which words are too common to be keywords (stop words)?
Which multi-word phrases are more informative than single words?

The Interview Questions They’ll Ask

“Explain the TextRank algorithm for keyword extraction.”
“How would you handle keyword extraction for very short documents?”
“What’s the difference between extractive and abstractive keyword generation?”
“How would you evaluate keyword extraction quality?”
“How do you handle synonyms and related terms in keyword extraction?”

Project 11: The Text Summarizer

Main Programming Language: Python
Libraries: spaCy, sumy or transformers
Difficulty: Advanced
Time Estimate: 10-12 hours

What you’ll build: A text summarization system implementing both extractive (selecting important sentences) and abstractive (generating new text) approaches. It will create concise summaries of long documents while preserving key information.

Why it teaches the concept: Summarization is one of the hardest NLP tasks because it requires understanding both content and structure. You’ll learn how to identify important information and either select or generate summary text.

Core challenges you’ll face:

Importance scoring → Which sentences matter most?
Coherence → Making summaries read naturally
Compression ratio → How much to shrink the original?
Evaluation → ROUGE scores and human evaluation

Key Concepts:

Extractive summarization: Selecting existing sentences
Abstractive summarization: Generating new sentences
ROUGE metrics: Evaluating summary quality

Prerequisites: Projects 1-10 completed

Real World Outcome

$ python summarizer.py --input long_article.txt --method extractive --ratio 0.2

Original: 2,500 words
Summary: 500 words (20% compression)

Extractive Summary:
───────────────────────────────────────────────────────────────
The rise of artificial intelligence has transformed multiple industries
in the past decade. Healthcare providers now use AI for diagnosis, while
financial institutions deploy it for fraud detection. Despite these
advances, concerns about job displacement and algorithmic bias remain...
───────────────────────────────────────────────────────────────

Sentence Importance Scores:
1. "The rise of artificial intelligence..." (0.92)
2. "Healthcare providers now use AI..." (0.87)
3. "Despite these advances, concerns..." (0.84)
...

$ python summarizer.py --input long_article.txt --method abstractive --max-length 100

Abstractive Summary:
───────────────────────────────────────────────────────────────
AI has revolutionized healthcare and finance over the past decade,
enabling better diagnosis and fraud detection. However, concerns
about job loss and bias persist.
───────────────────────────────────────────────────────────────

The Core Question You’re Answering

“How do we condense information while preserving meaning and readability?”

Humans naturally summarize—we can tell a friend about a movie in one sentence or ten. Teaching machines this requires understanding what makes information “important” and how to express it concisely.

Thinking Exercise

Consider this paragraph:

"The quick brown fox jumps over the lazy dog. The fox was very fast and agile.
The dog, being lazy, didn't move at all. This sentence is about foxes and dogs."

Which sentence is most important? Why?
Which sentence is redundant?
Write a one-sentence extractive summary.
Write a one-sentence abstractive summary.

The Interview Questions They’ll Ask

“What’s the difference between extractive and abstractive summarization?”
“How does ROUGE evaluation work?”
“What are the challenges of multi-document summarization?”
“How would you handle summarization for very long documents (100+ pages)?”
“What are the ethical considerations in automatic summarization?”

Hints in Layers

Hint 1: Extractive Approach Score sentences by: position (first sentences often important), TF-IDF of terms, presence of named entities, length.

Hint 2: TextRank for Sentences Build a graph where sentences are nodes, edges are similarity scores. PageRank identifies important sentences.

Hint 3: Abstractive with Transformers Use transformers library with pipeline("summarization") for quick abstractive summaries.

Hint 4: Evaluation Use rouge-score library: scorer.score(reference, hypothesis) returns ROUGE-1, ROUGE-2, ROUGE-L.

Project 12: The Question Answering System

Main Programming Language: Python
Libraries: spaCy, transformers
Difficulty: Advanced
Time Estimate: 12-15 hours

What you’ll build: An extractive question answering system that finds answers to questions within a given context document. It will identify the exact span of text that answers the question.

Why it teaches the concept: QA is the culmination of many NLP skills—understanding questions, finding relevant passages, and extracting precise answers. It’s the foundation for chatbots, search engines, and virtual assistants.

Core challenges you’ll face:

Question understanding → What is the question actually asking?
Passage retrieval → Finding relevant text for the question
Answer extraction → Identifying the exact answer span
Confidence scoring → Knowing when the answer isn’t in the text

Key Concepts:

Extractive QA: Finding answers in existing text
Generative QA: Creating answers from context
Reading comprehension: Understanding text to answer questions

Prerequisites: Projects 1-11 completed, transformer basics helpful

Real World Outcome

$ python qa_system.py --context document.txt

Question Answering System Ready
Context loaded: 2,500 words

> What year was the company founded?
Answer: "1998" (Confidence: 0.94)
Context: "...the company was founded in 1998 by two Stanford students..."

> Who is the current CEO?
Answer: "Sundar Pichai" (Confidence: 0.89)
Context: "...Sundar Pichai has served as CEO since 2015..."

> What is the company's market cap?
Answer: "Not found in context" (Confidence: 0.12)
Reason: The document doesn't contain market capitalization information.

> When did the company go public?
Answer: "August 2004" (Confidence: 0.87)
Context: "...Google's IPO in August 2004 raised $1.67 billion..."

Type 'quit' to exit.

The Core Question You’re Answering

“How can a machine find specific answers to natural language questions within text?”

Unlike search (which finds relevant documents), QA finds exact answers. This requires understanding both the question’s intent and the passage’s meaning well enough to identify where the answer lies.

Thinking Exercise

Given this passage:

"Albert Einstein was born in Ulm, Germany on March 14, 1879.
He developed the theory of relativity while working at the Swiss Patent Office.
Einstein received the Nobel Prize in Physics in 1921 for his explanation of
the photoelectric effect, not for relativity as commonly believed."

For each question, identify:

What type of answer is expected (date, person, location, etc.)?
Which sentence contains the answer?
What is the exact answer span?

Questions:

“Where was Einstein born?”
“When did Einstein win the Nobel Prize?”
“Why did Einstein win the Nobel Prize?”

The Interview Questions They’ll Ask

“What’s the difference between extractive and generative QA?”
“How do transformer models like BERT approach extractive QA?”
“What is the SQuAD dataset and why is it important?”
“How would you handle questions that can’t be answered from the context?”
“How would you scale QA to millions of documents?”

Hints in Layers

Hint 1: Transformer Pipeline Use pipeline("question-answering") from transformers for quick implementation.

Hint 2: spaCy Integration Use spaCy for preprocessing: sentence segmentation, entity recognition to narrow answer candidates.

Hint 3: Confidence Threshold Set a minimum confidence threshold (e.g., 0.3) below which return “answer not found.”

Hint 4: Multi-passage For long documents, chunk into passages, run QA on each, take highest confidence answer.

Project 13: The Intent Recognition Chatbot

Main Programming Language: Python
Libraries: spaCy, custom classifier
Difficulty: Advanced
Time Estimate: 15-20 hours

What you’ll build: A task-oriented chatbot that recognizes user intents, extracts entities from queries, and responds appropriately. It will handle multiple intents like booking, querying, and canceling for a specific domain.

Why it teaches the concept: Chatbots combine intent classification, entity extraction, and dialogue management. You’ll learn how production virtual assistants work at a fundamental level.

Core challenges you’ll face:

Intent classification → Recognizing what the user wants
Entity extraction → Pulling out relevant parameters
Slot filling → Getting all required information
Dialogue management → Maintaining conversation state

Key Concepts:

Intent recognition: Classifying user goals
Entity slot filling: Extracting parameters
Dialogue state tracking: Maintaining conversation context

Prerequisites: Projects 7-8 completed (text classification, custom NER)

Real World Outcome

$ python chatbot.py --domain restaurant_booking

Restaurant Booking Assistant
Type 'quit' to exit

User: I want to book a table for 4 people tomorrow evening
Bot: Great! I can help you book a table.
     Intent: book_table (confidence: 0.94)
     Extracted:
     - party_size: 4
     - date: tomorrow
     - time: evening

     What restaurant would you like to book at?

User: How about Italian place on Main Street
Bot: Looking for Italian restaurants on Main Street...
     I found "Luigi's Trattoria" at 123 Main Street.

     Confirming: Table for 4 at Luigi's Trattoria
     Date: December 15, 2024 (tomorrow)
     Time: Evening (7:00 PM default)

     Would you like me to complete this reservation?

User: Yes please
Bot: Your reservation is confirmed!
     Confirmation #: RT-2024-1215-0042

     Is there anything else I can help you with?

User: Can I change it to 6 people?
Bot: Intent: modify_booking (confidence: 0.91)

     I'll update your reservation from 4 to 6 people.
     Updated successfully!

Session Summary:
- Intents recognized: book_table, confirm, modify_booking
- Entities extracted: party_size (2), date (1), time (1), cuisine (1), location (1)
- Successful completion: Yes

The Core Question You’re Answering

“How do we build a system that understands user requests in natural language and takes appropriate actions?”

Chatbots must bridge human language and system actions. This requires classifying what the user wants (intent), extracting relevant parameters (entities), and managing multi-turn conversations (state).

Thinking Exercise

For a flight booking chatbot, define:

Intents (5+): What actions can users take?
- book_flight, search_flights, cancel_booking, check_status, …
Entities (5+): What parameters are needed?
- origin_city, destination_city, departure_date, …
Required slots for booking: Which entities must be filled?
Clarification questions: What do you ask when slots are missing?

The Interview Questions They’ll Ask

“How do you handle out-of-scope queries in a chatbot?”
“What’s the difference between intent classification and entity extraction?”
“How do you manage dialogue state across multiple turns?”
“How would you handle ambiguous user inputs?”
“What metrics would you use to evaluate chatbot performance?”

Project 14: The Language Detector

Main Programming Language: Python
Libraries: spaCy, langdetect or custom
Difficulty: Intermediate
Time Estimate: 5-6 hours

What you’ll build: A language detection system that identifies the language of text, handles mixed-language documents, and provides confidence scores for multiple candidate languages.

Why it teaches the concept: Language detection is often the first step in multilingual NLP pipelines. You’ll understand how statistical patterns (character n-grams) can identify language even from short text snippets.

Core challenges you’ll face:

Short text detection → Identifying language from just a few words
Similar languages → Distinguishing Spanish from Portuguese
Mixed-language text → Handling code-switching
Unknown languages → Graceful handling of unsupported languages

Key Concepts:

Character n-grams: Statistical language signatures
Language models: Probability distributions over text
Multilingual processing: Handling multiple languages

Prerequisites: Projects 1-3 completed

Real World Outcome

$ python lang_detector.py --text "Bonjour, comment allez-vous?"

Language Detection Results:
┌──────────────┬────────────┬────────────┐
│ Language     │ Code       │ Confidence │
├──────────────┼────────────┼────────────┤
│ French       │ fr         │ 0.98       │
│ Italian      │ it         │ 0.01       │
│ Spanish      │ es         │ 0.01       │
└──────────────┴────────────┴────────────┘

$ python lang_detector.py --file multilingual_doc.txt --per-sentence

Per-Sentence Language Detection:
┌─────┬─────────────────────────────────────────┬──────────┬────────────┐
│ Line│ Text                                    │ Language │ Confidence │
├─────┼─────────────────────────────────────────┼──────────┼────────────┤
│ 1   │ Hello, how are you?                     │ en       │ 0.99       │
│ 2   │ Je vais très bien, merci.              │ fr       │ 0.97       │
│ 3   │ Danke schön!                           │ de       │ 0.95       │
│ 4   │ Muchas gracias por tu ayuda.           │ es       │ 0.98       │
└─────┴─────────────────────────────────────────┴──────────┴────────────┘

Language Distribution:
- English: 25%
- French: 25%
- German: 25%
- Spanish: 25%

The Core Question You’re Answering

“How can we automatically determine what language a piece of text is written in?”

Languages have distinctive statistical fingerprints—French has many “eux” and “tion” endings, German has compound words and umlauts. Character-level patterns are surprisingly effective for identification.

Thinking Exercise

Rank these text samples by detection difficulty (1=easiest, 5=hardest):

“The quick brown fox jumps over the lazy dog” (English)
“Hello” (English)
“Das ist gut” (German)
“Olá” vs “Hola” (Portuguese vs Spanish)
“I am going al mercado to buy comida” (Code-switching)

What makes some cases harder than others?

The Interview Questions They’ll Ask

“How does character n-gram language detection work?”
“What minimum text length is needed for reliable detection?”
“How would you handle code-switching (multiple languages in one text)?”
“What’s the difference between language detection and script detection?”
“How would you train a language detector for low-resource languages?”

Project 15: Custom Pipeline Components

Main Programming Language: Python
Libraries: spaCy
Difficulty: Advanced
Time Estimate: 8-10 hours

What you’ll build: Custom spaCy pipeline components that extend the library’s functionality—such as custom entity rulers, sentiment components, or domain-specific analyzers that integrate seamlessly into the processing pipeline.

Why it teaches the concept: Understanding spaCy’s pipeline architecture is essential for production NLP. You’ll learn how to add, modify, and optimize pipeline components for specific use cases.

Core challenges you’ll face:

Component architecture → Understanding the spaCy component interface
Extension attributes → Adding custom data to Doc, Span, Token
Pipeline order → Understanding component dependencies
Serialization → Saving and loading custom components

Key Concepts:

Factory pattern: Creating reusable components
Extension attributes: Doc.set_extension(), Span.set_extension()
Pipeline configuration: config.cfg for component settings

Prerequisites: Projects 1-8 completed, solid spaCy understanding

Real World Outcome

$ python custom_pipeline.py --demo

Custom Pipeline Components Demo

Loading pipeline with custom components:
- 'sentiment': Adds sentiment score to each sentence
- 'profanity_filter': Flags inappropriate content
- 'acronym_resolver': Expands common acronyms
- 'readability': Calculates Flesch-Kincaid score

Processing: "The CEO announced that ROI exceeded expectations. This is great news!"

Pipeline Output:
┌────────────────────────────────────────────────────────────────────┐
│ Standard spaCy Output:                                             │
│ Entities: CEO (PERSON implied), ROI (not recognized by default)    │
│ POS Tags: DET NOUN VERB SCONJ NOUN VERB NOUN PUNCT ...           │
├────────────────────────────────────────────────────────────────────┤
│ Custom Extensions:                                                 │
│                                                                    │
│ doc._.sentiment_score: 0.75 (positive)                            │
│ doc._.readability_score: 58.2 (fairly easy to read)               │
│ doc._.profanity_found: False                                       │
│ doc._.acronyms_found: [('ROI', 'Return on Investment')]           │
│                                                                    │
│ Per-sentence sentiment:                                            │
│ - Sent 1: "The CEO announced..." (neutral: 0.12)                  │
│ - Sent 2: "This is great news!" (positive: 0.89)                  │
└────────────────────────────────────────────────────────────────────┘

$ python -c "import spacy; nlp = spacy.load('en_core_web_sm'); nlp.add_pipe('sentiment'); print(nlp.pipe_names)"
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'sentiment']

The Core Question You’re Answering

“How do we extend spaCy’s capabilities for domain-specific needs?”

spaCy provides a powerful foundation, but real applications need customization. Understanding the component system lets you add any functionality while maintaining spaCy’s speed and integration.

Thinking Exercise

Design a custom pipeline component for analyzing legal documents:

What custom extension attributes would you add?
- doc._.legal_terms: List of legal terminology found
- doc._.contract_type: Classification of document type
- …
What existing components does your component depend on?
Where in the pipeline should it run (before/after what)?
How would you handle serialization for model deployment?

The Interview Questions They’ll Ask

“What’s the difference between a pipeline component and an extension attribute?”
“How does spaCy’s @Language.component decorator work?”
“What is the config.cfg system and why is it important?”
“How would you add a custom component that needs a machine learning model?”
“What are the performance implications of adding custom components?”

Hints in Layers

Hint 1: Basic Component

@Language.component("my_component")
def my_component(doc):
    # Process doc
    return doc
nlp.add_pipe("my_component")

Hint 2: Extension Attributes

Doc.set_extension("my_attr", default=None)
doc._.my_attr = computed_value

Hint 3: Factory Pattern

@Language.factory("configurable_component")
def create_component(nlp, name, threshold: float):
    return ConfigurableComponent(threshold)

Hint 4: Serialization Implement to_disk() and from_disk() methods for saving component state.

Project 16: Training spaCy Models from Scratch

Main Programming Language: Python
Libraries: spaCy
Difficulty: Advanced
Time Estimate: 15-20 hours

What you’ll build: Train spaCy models from scratch for a specific domain or language, including tokenization rules, POS tagging, NER, and dependency parsing. You’ll understand the full model training pipeline.

Why it teaches the concept: While pre-trained models work well for general text, many applications need domain-specific models. Understanding training from scratch reveals how spaCy’s models actually work.

Core challenges you’ll face:

Training data preparation → Converting annotations to spaCy format
Config optimization → Tuning hyperparameters for your task
Evaluation → Measuring model quality and diagnosing issues
Iteration → Improving models based on error analysis

Key Concepts:

Training config system: config.cfg structure and settings
Corpus preparation: Creating training and dev sets
Model architectures: Understanding network options

Prerequisites: Projects 8, 15 completed, ML training experience

Real World Outcome

$ python train_model.py init-config --lang en --pipeline ner,textcat

Generated: config.cfg

$ python train_model.py prepare-data --input annotated_data.json --output corpus/

Preparing training corpus...
Training examples: 8,000
Development examples: 2,000
Saved: corpus/train.spacy, corpus/dev.spacy

$ python -m spacy train config.cfg --output models/my_model --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy

Training Pipeline: ['ner', 'textcat']
================================================================================
E    #       LOSS NER  LOSS TEXTC  ENTS_F  CATS_SCORE  SCORE
---  ------  --------  ----------  ------  ----------  ------
  0       0     25.12       32.45    0.12        0.34    0.23
  1     200     12.45       18.23    0.45        0.56    0.50
  2     400      8.32       12.11    0.62        0.68    0.65
  3     600      5.21        8.45    0.74        0.76    0.75
  4     800      3.45        6.23    0.81        0.82    0.81
  5    1000      2.12        4.56    0.86        0.87    0.86
================================================================================

Best model saved: models/my_model/model-best

$ python -m spacy evaluate models/my_model/model-best corpus/test.spacy

================================== Results ==================================
TOK     100.00
NER P   86.23
NER R   84.56
NER F   85.38
TEXTCAT (macro F1)   87.12
SPEED   12,543 words/sec

The Core Question You’re Answering

“How do we create NLP models tailored to specific domains and tasks?”

Pre-trained models are trained on general web text. Legal documents, medical records, and scientific papers have different language patterns. Training custom models captures these domain-specific patterns.

Thinking Exercise

You need to build an NLP model for analyzing social media posts about products:

Data requirements: How many examples do you need for each component?
- NER: ~500+ examples per entity type
- Classification: ~1000+ examples per category
Annotation strategy: How will you create training data efficiently?
Evaluation plan: How will you know the model is good enough?
Iteration strategy: How will you improve weak areas?

The Interview Questions They’ll Ask

“What’s the minimum training data needed for a custom NER model?”
“How do you choose hyperparameters for spaCy training?”
“What’s transfer learning and how does spaCy use it?”
“How would you handle a model that has good precision but poor recall?”
“What’s the difference between training from scratch vs fine-tuning?”

Project 17: spaCy with Transformers

Main Programming Language: Python
Libraries: spaCy, spacy-transformers
Difficulty: Advanced
Time Estimate: 12-15 hours

What you’ll build: Integrate transformer models (BERT, RoBERTa) into spaCy pipelines for state-of-the-art accuracy on NER, classification, and other tasks. You’ll understand when transformers are worth the computational cost.

Why it teaches the concept: Transformers have revolutionized NLP accuracy but are computationally expensive. Understanding how to integrate them with spaCy lets you build high-accuracy production systems.

Core challenges you’ll face:

Model selection → Choosing the right transformer for your task
Performance tradeoffs → Accuracy vs. speed vs. memory
Fine-tuning → Adapting pre-trained transformers to your domain
Deployment → Running transformers in production

Key Concepts:

Transformer architecture basics: Attention, encoders
spacy-transformers: Integration layer
Model distillation: Making transformers production-ready

Prerequisites: Projects 1-16 completed, understanding of deep learning helpful

Real World Outcome

$ python transformer_nlp.py compare --input test_data.txt

Model Comparison: Standard vs Transformer
=========================================

Task: Named Entity Recognition
┌─────────────────┬────────────┬────────────┬────────────┐
│ Model           │ F1 Score   │ Speed      │ Memory     │
├─────────────────┼────────────┼────────────┼────────────┤
│ en_core_web_sm  │ 0.84       │ 10,500/sec │ 50MB       │
│ en_core_web_trf │ 0.92       │ 150/sec    │ 1.2GB      │
└─────────────────┴────────────┴────────────┴────────────┘

Task: Text Classification
┌─────────────────┬────────────┬────────────┬────────────┐
│ Model           │ Accuracy   │ Speed      │ Memory     │
├─────────────────┼────────────┼────────────┼────────────┤
│ en_core_web_sm  │ 0.87       │ 8,200/sec  │ 50MB       │
│ en_core_web_trf │ 0.95       │ 120/sec    │ 1.2GB      │
└─────────────────┴────────────┴────────────┴────────────┘

Recommendation:
- Use transformer if: Accuracy is critical, batch processing OK, have GPU
- Use standard if: Real-time requirements, limited memory, high volume

$ python transformer_nlp.py fine-tune --base roberta-base --data domain_data/ --output custom_trf/

Fine-tuning RoBERTa for domain-specific NER...
Base model: roberta-base
Training examples: 5,000

Epoch 1: Loss=2.34, F1=0.75
Epoch 2: Loss=1.23, F1=0.85
Epoch 3: Loss=0.67, F1=0.91

Model saved: custom_trf/
Domain-specific F1: 0.91 (vs 0.78 with generic model)

The Core Question You’re Answering

“When and how should we use transformers for NLP tasks?”

Transformers achieve state-of-the-art accuracy but are 50-100x slower than traditional models. Understanding this tradeoff—and how to mitigate it—is essential for production NLP systems.

Thinking Exercise

You’re building an NLP system for processing customer support tickets:

Volume: 100,000 tickets/day
Latency requirement: < 200ms response time
Accuracy need: High (affects customer routing)

Questions:

Can you use transformers? What constraints?
What architecture would you design?
How would you handle the accuracy vs. speed tradeoff?

The Interview Questions They’ll Ask

“What is attention and why is it important for NLP?”
“When would you choose a transformer model over a traditional approach?”
“How do you deploy transformer models in production?”
“What is model distillation and when would you use it?”
“Explain the difference between BERT, RoBERTa, and DistilBERT.”

Hints in Layers

Hint 1: Installation

pip install spacy-transformers
python -m spacy download en_core_web_trf

Hint 2: Basic Usage

nlp = spacy.load("en_core_web_trf")
doc = nlp("Text to process")
# Works just like regular spaCy

Hint 3: Custom Transformer Config

[components.transformer]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"

Hint 4: GPU Acceleration Enable GPU with spacy.prefer_gpu() before loading the model for 5-10x speedup.

Project 18: Production Deployment Pipeline

Main Programming Language: Python
Libraries: spaCy, FastAPI, Docker
Difficulty: Advanced
Time Estimate: 15-20 hours

What you’ll build: A production-ready NLP API service that handles high-volume text processing with proper error handling, monitoring, scaling, and deployment. This is the capstone that ties everything together.

Why it teaches the concept: Building NLP models is only half the battle—deploying them reliably at scale is equally important. You’ll learn the engineering required for production NLP systems.

Core challenges you’ll face:

API design → Creating efficient, well-documented endpoints
Scalability → Handling thousands of concurrent requests
Monitoring → Tracking performance and errors
Model updates → Deploying new models without downtime

Key Concepts:

REST API design: Efficient endpoints for NLP
Containerization: Docker for reproducible deployments
Load balancing: Handling high request volumes

Prerequisites: All previous projects, basic DevOps knowledge

Real World Outcome

$ docker build -t nlp-api .
$ docker run -p 8000:8000 nlp-api

NLP API Service Starting...
Loading models: en_core_web_md
Health check: OK
Listening on: http://0.0.0.0:8000

$ curl -X POST "http://localhost:8000/api/v1/analyze" \
  -H "Content-Type: application/json" \
  -d '{"text": "Apple CEO Tim Cook announced new products.", "tasks": ["ner", "pos"]}'

{
  "success": true,
  "request_id": "req_abc123",
  "processing_time_ms": 45,
  "results": {
    "entities": [
      {"text": "Apple", "label": "ORG", "start": 0, "end": 5},
      {"text": "Tim Cook", "label": "PERSON", "start": 10, "end": 18}
    ],
    "tokens": [
      {"text": "Apple", "pos": "PROPN", "lemma": "Apple"},
      {"text": "CEO", "pos": "NOUN", "lemma": "ceo"},
      ...
    ]
  }
}

$ curl "http://localhost:8000/api/v1/health"
{
  "status": "healthy",
  "model_loaded": true,
  "requests_processed": 1247,
  "average_latency_ms": 38,
  "memory_usage_mb": 512
}

$ curl "http://localhost:8000/api/v1/batch" \
  -H "Content-Type: application/json" \
  -d '{"texts": ["Text 1...", "Text 2...", "Text 3..."], "tasks": ["ner"]}'

{
  "success": true,
  "batch_size": 3,
  "total_time_ms": 89,
  "results": [...]
}

The Core Question You’re Answering

“How do we serve NLP models reliably at scale in production?”

A model that works in a Jupyter notebook is far from production-ready. Production requires handling failures, managing resources, monitoring performance, and updating without downtime.

Thinking Exercise

Design a production NLP service for a company processing 1 million documents per day:

Architecture: How many instances? Load balancing?
Latency: What P99 latency can you achieve?
Reliability: How do you handle model failures?
Updates: How do you deploy new models?
Monitoring: What metrics do you track?

The Interview Questions They’ll Ask

“How would you scale an NLP service to handle 10,000 requests per second?”
“What’s the difference between horizontal and vertical scaling for NLP?”
“How do you handle model versioning in production?”
“What monitoring would you set up for an NLP API?”
“How do you ensure high availability for an NLP service?”

Hints in Layers

Hint 1: FastAPI Setup

from fastapi import FastAPI
import spacy

app = FastAPI()
nlp = spacy.load("en_core_web_md")

@app.post("/analyze")
async def analyze(text: str):
    doc = nlp(text)
    return {"entities": [(e.text, e.label_) for e in doc.ents]}

Hint 2: Batch Processing Use nlp.pipe(texts) for efficient batch processing instead of processing one at a time.

Hint 3: Async Handling Use concurrent.futures.ProcessPoolExecutor for CPU-bound NLP tasks to not block the event loop.

Hint 4: Docker Configuration

FROM python:3.10-slim
RUN pip install spacy fastapi uvicorn
RUN python -m spacy download en_core_web_md
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Project Comparison Table

Project	Difficulty	Time	Key Skills	Real-World Application
1. Smart Tokenizer	Beginner	3-4h	Tokenization, spaCy basics	Text preprocessing
2. Grammar Detective	Beginner	4-5h	POS tagging, visualization	Writing analysis
3. Entity Extractor	Beginner-Int	5-6h	NER, displacy	Information extraction
4. Dependency Visualizer	Intermediate	6-7h	Parsing, tree structures	Sentence analysis
5. Lemma vs Stem	Intermediate	4-5h	Morphology, comparison	Search indexing
6. Sentiment Analyzer	Intermediate	6-8h	Classification, lexicons	Social media analysis
7. Text Classifier	Intermediate	8-10h	ML training, evaluation	Content categorization
8. Custom NER	Int-Advanced	10-12h	Annotation, training	Domain-specific extraction
9. Similarity Engine	Intermediate	6-8h	Vectors, embeddings	Document search
10. Keyword Extractor	Intermediate	5-7h	Graph algorithms, TF-IDF	SEO, tagging
11. Text Summarizer	Advanced	10-12h	Extractive/abstractive	Content condensation
12. QA System	Advanced	12-15h	Reading comprehension	Virtual assistants
13. Intent Chatbot	Advanced	15-20h	Dialogue management	Customer support
14. Language Detector	Intermediate	5-6h	Statistical patterns	Multilingual processing
15. Custom Pipelines	Advanced	8-10h	spaCy architecture	Custom NLP tools
16. Training from Scratch	Advanced	15-20h	Full training pipeline	Domain adaptation
17. Transformers	Advanced	12-15h	BERT, fine-tuning	High-accuracy NLP
18. Production Deploy	Advanced	15-20h	API, Docker, scaling	Enterprise deployment

Summary and Learning Paths

By Difficulty Level

Beginner Path (Projects 1-3) Start here if you’re new to NLP. You’ll learn fundamental concepts and spaCy basics.

Time: 2-3 weeks
Outcome: Basic NLP pipeline understanding

Intermediate Path (Projects 4-10) For those with Python experience who want practical NLP skills.

Time: 4-6 weeks
Outcome: Can build custom NLP applications

Advanced Path (Projects 11-18) For those ready to build production-grade NLP systems.

Time: 8-12 weeks
Outcome: Production-ready NLP engineer

By Application Area

Information Extraction: Projects 3, 8, 9, 10 Classification & Sentiment: Projects 6, 7, 14 Understanding & Generation: Projects 11, 12, 13 Production Engineering: Projects 15, 16, 17, 18

Essential Reading Checklist

“Natural Language Processing with Python and spaCy” by Yuli Vasiliev (No Starch Press)
spaCy official documentation (spacy.io/usage)
freeCodeCamp NLP Course by Dr. William Mattingly
“Speech and Language Processing” by Jurafsky & Martin (free online)
Explosion AI blog (explosion.ai/blog)

Appendix: Quick Reference

spaCy Model Sizes

Model	Size	Speed	Accuracy	Use Case
`en_core_web_sm`	12MB	Fastest	Good	Development, prototyping
`en_core_web_md`	40MB	Fast	Better	General use, has vectors
`en_core_web_lg`	560MB	Medium	Best (non-trf)	Production, similarity
`en_core_web_trf`	438MB	Slowest	Best	High-accuracy requirements

Common Token Attributes

Attribute	Type	Description
`text`	str	Original text
`lemma_`	str	Base form
`pos_`	str	Coarse POS tag
`tag_`	str	Fine-grained POS tag
`dep_`	str	Dependency relation
`ent_type_`	str	Entity type
`is_stop`	bool	Is stop word
`is_punct`	bool	Is punctuation
`vector`	ndarray	Word vector

Entity Types (English Models)

Type	Description	Example
PERSON	People	“Barack Obama”
ORG	Organizations	“Apple Inc.”
GPE	Countries, cities	“France”
LOC	Non-GPE locations	“Mount Everest”
DATE	Dates	“June 5, 2023”
TIME	Times	“3:00 PM”
MONEY	Monetary values	“$500”
PERCENT	Percentages	“25%”
PRODUCT	Products	“iPhone”
EVENT	Events	“Olympics”
WORK_OF_ART	Titles	“The Matrix”
LAW	Laws	“First Amendment”
LANGUAGE	Languages	“English”

Dependency Relations

Relation	Description	Example
nsubj	Nominal subject	“John runs”
dobj	Direct object	“I see him”
iobj	Indirect object	“Give me the book”
prep	Prepositional modifier	“the book on the table”
pobj	Object of preposition	“on the table”
amod	Adjectival modifier	“the red car”
det	Determiner	“the book”
ROOT	Root of sentence	Typically the main verb

Last updated: 2026-01-01 Total projects: 18 | Estimated completion time: 100-150 hours | Difficulty range: Beginner to Advanced