NLP with Python and spaCy: From Text to Intelligence
Goal: Master Natural Language Processing from the ground up using Python and spaCy. You will progress from understanding how computers break text into meaningful units to building production-ready NLP pipelines that extract entities, classify documents, answer questions, and power intelligent chatbots. By the end, you’ll deeply understand linguistic concepts, statistical models, and the engineering required to process human language at scale.
Why NLP Matters
In 1950, Alan Turing proposed the “Imitation Game”—could a machine converse so naturally that a human couldn’t distinguish it from another human? Today, NLP has moved from science fiction to science fact. Every time you ask Siri a question, get an email auto-completed, or have spam filtered from your inbox, you’re using NLP.
Understanding NLP matters because:
-
Ubiquity: The global NLP market was valued at $29.1 billion in 2024 and is projected to reach $158.6 billion by 2032. Every industry—healthcare, finance, legal, customer service—is being transformed by language AI.
-
Data Explosion: 80% of enterprise data is unstructured text—emails, documents, social media, chat logs. NLP is the key to unlocking insights from this massive, untapped resource.
-
Foundation for Modern AI: Large Language Models (LLMs) like GPT-4 and Claude are built on NLP foundations. Understanding tokenization, embeddings, and linguistic structure is essential for working with any modern AI system.
-
Career Leverage: NLP engineers command premium salaries. According to Glassdoor, the average NLP engineer salary in the US is $156,000, with senior roles exceeding $250,000.
The NLP Landscape
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Raw Text ──► Preprocessing ──► Analysis ──► Understanding │
│ │
│ "The quick Tokenization POS Tags Meaning & │
│ brown fox" Lemmatization Entities Intent │
│ Cleaning Relations │
│ │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional NLP vs Modern NLP │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Hand-crafted │ │ Learned from │ │
│ │ rules │ │ data │ │
│ │ Regex patterns │ │ Neural networks │ │
│ │ Grammar parsers │ │ Transformers │ │
│ └─────────────────┘ └─────────────────┘ │
│ │
│ spaCy bridges both worlds: industrial-strength, efficient, │
│ with pre-trained models and transformer integration │
│ │
└─────────────────────────────────────────────────────────────────┘
Why spaCy?
spaCy has become the industry standard for production NLP because:
-
Speed: Written in Cython, spaCy is one of the fastest NLP libraries—10-20x faster than NLTK for many tasks.
-
Batteries Included: Pre-trained models for 75+ languages, ready-to-use pipelines, and sensible defaults.
-
Production-Ready: Designed for real applications, not just research. Clear APIs, consistent interfaces, and excellent documentation.
-
Modern Architecture: Native support for transformers (BERT, RoBERTa), custom training pipelines, and extensible components.
-
Active Development: Backed by Explosion AI, with regular updates, new features, and strong community support.
spaCy's Processing Pipeline
┌─────────────────────────────────────────────────────────────────┐
│ │
│ Text ──►┌─────────┐──►┌─────────┐──►┌─────────┐──►┌─────────┐ │
│ │Tokenizer│ │ Tagger │ │ Parser │ │ NER │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ [Tokens] [POS Tags] [Dependencies] [Entities] │
│ │
│ All components are modular and can be: │
│ • Enabled/disabled │
│ • Replaced with custom implementations │
│ • Extended with new functionality │
│ │
└─────────────────────────────────────────────────────────────────┘
Prerequisites & Background Knowledge
Essential Prerequisites (Must Have)
- Python Fundamentals: Functions, classes, list comprehensions, decorators
- Basic Data Structures: Lists, dictionaries, sets, tuples
- File I/O: Reading/writing text files, CSV handling
- Command Line Basics: Running Python scripts, pip installation
Helpful But Not Required
- Basic Statistics: Mean, variance, probability distributions
- Regular Expressions: Pattern matching (helpful for text preprocessing)
- Machine Learning Basics: Training/test splits, evaluation metrics
- Linguistics Concepts: Grammar, parts of speech (you’ll learn these)
Self-Assessment Questions
Before starting, you should be able to answer:
- “How would you read a large text file line by line in Python without loading it all into memory?”
- “What’s the difference between a list and a generator in Python?”
- “How would you count word frequencies in a string using a dictionary?”
- “What does
pip installdo, and what’s a virtual environment?”
If you struggle with these, spend a few days on Python basics first.
Development Environment Setup
# Create a virtual environment
python -m venv nlp-env
source nlp-env/bin/activate # On Windows: nlp-env\Scripts\activate
# Install spaCy
pip install spacy
# Download English language model (medium size, good balance)
python -m spacy download en_core_web_md
# For transformer models (optional, larger)
python -m spacy download en_core_web_trf
# Additional useful libraries
pip install pandas numpy scikit-learn matplotlib seaborn
# For Jupyter notebooks (recommended)
pip install jupyter
Time Investment
| Experience Level | Expected Time |
|---|---|
| Python beginner | 12-16 weeks |
| Python intermediate | 8-10 weeks |
| Some ML/NLP experience | 4-6 weeks |
Reality Check
NLP is challenging because human language is inherently ambiguous. “Time flies like an arrow; fruit flies like a banana” contains the same words in similar patterns but wildly different meanings. You’ll encounter edge cases constantly. Embrace the ambiguity—it’s what makes NLP fascinating.
Core Concept Analysis
1. Tokenization: Breaking Text into Units
Tokenization is the first and most fundamental step in any NLP pipeline. It converts raw text into discrete units (tokens) that can be processed.
Input: "Dr. Smith doesn't believe U.S.A. won't win."
Naive Split (by spaces):
["Dr.", "Smith", "doesn't", "believe", "U.S.A.", "won't", "win."]
spaCy Tokenization:
["Dr.", "Smith", "does", "n't", "believe", "U.S.A.", "wo", "n't", "win", "."]
Why the difference?
- Contractions are split: "doesn't" → "does" + "n't"
- Punctuation is separate: "win." → "win" + "."
- Abbreviations preserved: "Dr.", "U.S.A."
2. Part-of-Speech Tagging: Understanding Grammar
POS tagging assigns grammatical labels to each token—noun, verb, adjective, etc.
"The quick brown fox jumps over the lazy dog"
Token │ POS Tag │ Meaning
──────────┼─────────┼─────────────────
The │ DET │ Determiner
quick │ ADJ │ Adjective
brown │ ADJ │ Adjective
fox │ NOUN │ Noun
jumps │ VERB │ Verb
over │ ADP │ Adposition (preposition)
the │ DET │ Determiner
lazy │ ADJ │ Adjective
dog │ NOUN │ Noun
Fine-Grained Tags (Penn Treebank):
jumps → VBZ (Verb, 3rd person singular present)
fox → NN (Noun, singular)
3. Named Entity Recognition (NER): Finding Real-World Entities
NER identifies and classifies named entities in text—people, organizations, locations, dates, etc.
"Apple CEO Tim Cook announced a new product at WWDC in San Francisco on June 5th."
Entity │ Label │ Description
──────────────┼───────┼─────────────────────
Apple │ ORG │ Organization
Tim Cook │ PERSON│ Person name
WWDC │ EVENT │ Event name
San Francisco │ GPE │ Geopolitical entity (city)
June 5th │ DATE │ Date expression
spaCy's Built-in Entity Types:
┌────────┬──────────────────────────────────────┐
│ PERSON │ People, including fictional │
│ ORG │ Companies, agencies, institutions │
│ GPE │ Countries, cities, states │
│ LOC │ Non-GPE locations (mountains, rivers)│
│ DATE │ Absolute or relative dates │
│ TIME │ Times smaller than a day │
│ MONEY │ Monetary values │
│ PERCENT│ Percentages │
│ PRODUCT│ Objects, vehicles, foods │
│ EVENT │ Named events │
└────────┴──────────────────────────────────────┘
4. Dependency Parsing: Understanding Sentence Structure
Dependency parsing reveals the grammatical relationships between words.
"The cat sat on the mat"
sat (ROOT)
/ \
cat on
/ \
The mat
/
the
Relations:
- "cat" is the nominal subject (nsubj) of "sat"
- "on" is a prepositional modifier (prep) of "sat"
- "mat" is the object of preposition (pobj)
- "The/the" are determiners (det)
5. Word Vectors and Similarity
Words can be represented as dense vectors in a high-dimensional space, where similar words are close together.
Vector Space Visualization (simplified to 2D):
▲
│ ★ king
│ ★ queen
royalty │
│
│ ★ man
│ ★ woman
common │
│
└────────────────────────────────►
male female
The famous relationship:
king - man + woman ≈ queen
Cosine Similarity:
sim(cat, dog) = 0.80 (high - both animals)
sim(cat, car) = 0.15 (low - unrelated)
6. The spaCy Doc Object Model
spaCy processes text into a rich object model:
Doc
│
┌───────────┼───────────┐
│ │ │
Tokens Spans Ents
│ │ │
┌────┴────┐ Sent Named
│ │ │ Spans Entities
text lemma pos
idx dep ent_type
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
doc.text → "Apple is looking at buying U.K. startup for $1 billion"
doc[0].text → "Apple"
doc[0].pos_ → "PROPN"
doc[0].dep_ → "nsubj"
doc[0].ent_type_ → "ORG"
doc.ents → (Apple, U.K., $1 billion)
list(doc.sents) → [Apple is looking at buying U.K. startup for $1 billion]
Concept Summary Table
| Concept | What You Must Internalize |
|---|---|
| Tokenization | How text is split into meaningful units, handling contractions, punctuation, and special cases |
| POS Tagging | Grammatical categories of words and why context matters for ambiguous words |
| Named Entity Recognition | Identifying and classifying real-world entities in unstructured text |
| Dependency Parsing | The tree structure of sentences showing grammatical relationships |
| Lemmatization | Reducing words to their dictionary form while preserving meaning |
| Word Vectors | Dense numerical representations that capture semantic meaning |
| Pipeline Architecture | How spaCy processes text through a series of modular components |
| Training & Fine-tuning | Adapting models to domain-specific data and custom entity types |
Deep Dive Reading by Concept
Primary Resources
| Concept | Resource & Chapter |
|---|---|
| spaCy Fundamentals | “Natural Language Processing with Python and spaCy” by Yuli Vasiliev (No Starch Press) — Ch. 1-3 |
| Tokenization & Preprocessing | Vasiliev Ch. 2: “Working with spaCy” |
| POS Tagging & Morphology | Vasiliev Ch. 4: “Extracting and Using Linguistic Features” |
| NER & Entity Recognition | Vasiliev Ch. 5: “Working with Word Vectors and Semantic Similarity” |
| Dependency Parsing | Vasiliev Ch. 6: “Finding Patterns and Walking the Dependency Tree” |
| Custom Training | spaCy Documentation: “Training Pipelines” |
| Transformers Integration | spaCy Documentation: “spacy-transformers” |
Supplementary Resources
| Topic | Resource |
|---|---|
| NLP Theory | “Speech and Language Processing” by Jurafsky & Martin (free online) — Chapters on POS, NER, Parsing |
| Practical NLP | freeCodeCamp NLP Course by Dr. William Mattingly (YouTube/freeCodeCamp) |
| Deep Learning NLP | “Natural Language Processing with Transformers” by Tunstall, von Werra, Wolf (O’Reilly) |
| spaCy Internals | Explosion AI Blog (explosion.ai/blog) |
Essential Reading Order
Week 1: Foundations
- Vasiliev Ch. 1-2 (spaCy basics)
- spaCy 101 documentation
Week 2: Linguistic Features
- Vasiliev Ch. 4 (POS, morphology)
- Jurafsky Ch. 8 (POS tagging theory)
Week 3: Entities & Structure
- Vasiliev Ch. 5-6 (NER, dependencies)
- freeCodeCamp NER tutorials
Week 4-6: Advanced Topics
- spaCy training documentation
- Transformer integration guides
Quick Start Guide (First 48 Hours)
If you’re feeling overwhelmed, here’s your minimal path:
Day 1 (4 hours):
- Install spaCy and download
en_core_web_sm - Complete Project 1 (Text Tokenizer)
- Read Vasiliev Ch. 1
Day 2 (4 hours):
- Complete Project 2 (POS Tagger)
- Complete Project 3 (Entity Extractor)
- Experiment with your own text samples
After 48 hours, you’ll have working code for tokenization, POS tagging, and NER—the core building blocks for everything else.
Recommended Learning Paths
Path A: Complete Beginner (12 weeks)
Weeks 1-2: Projects 1-3 (Fundamentals)
Weeks 3-4: Projects 4-6 (Analysis)
Weeks 5-6: Projects 7-9 (Applications)
Weeks 7-8: Projects 10-12 (Advanced)
Weeks 9-10: Projects 13-15 (Integration)
Weeks 11-12: Projects 16-18 (Production)
Path B: Python Developer (6 weeks)
Week 1: Projects 1-4 (fast-track fundamentals)
Week 2: Projects 5-7 (core applications)
Week 3: Projects 8-10 (custom training)
Week 4: Projects 11-13 (advanced NLP)
Week 5: Projects 14-16 (transformers)
Week 6: Projects 17-18 (production)
Path C: Data Scientist (4 weeks)
Focus on Projects: 1, 3, 5, 7, 10, 11, 14, 16, 18 Skip detailed linguistics, emphasize ML integration and production.
Project 1: The Smart Text Tokenizer
- Main Programming Language: Python
- Libraries: spaCy
- Difficulty: Beginner
- Time Estimate: 3-4 hours
What you’ll build: A command-line tool that tokenizes text files and outputs detailed token information including text, lemma, POS tag, and entity type. It will handle edge cases like contractions, URLs, emails, and special characters.
Why it teaches the concept: Tokenization is the foundation of all NLP. You’ll understand how spaCy breaks text into meaningful units and why naive splitting by spaces fails for real text.
Core challenges you’ll face:
- Contractions → Understanding why “don’t” becomes [“do”, “n’t”]
- Special tokens → Handling URLs, emails, hashtags, @mentions
- Sentence boundaries → Distinguishing “Dr.” from end-of-sentence periods
- Unicode handling → Processing emojis and non-ASCII characters
Key Concepts:
- Token attributes: text, lemma_, pos_, dep_, ent_type_
- Tokenizer exceptions: spaCy’s special case handling
- Custom tokenization: Adding your own rules
Prerequisites: Python basics, command-line familiarity
Real World Outcome
A CLI tool that processes any text file and outputs structured token analysis:
$ python tokenizer.py --input document.txt --output tokens.json
Processing: document.txt
Total tokens: 1,247
Unique tokens: 489
Sentences: 52
$ python tokenizer.py --text "Dr. Smith's email is john.smith@company.com"
Token Analysis:
┌──────────────────────┬────────┬─────────┬──────────┐
│ Text │ Lemma │ POS │ Entity │
├──────────────────────┼────────┼─────────┼──────────┤
│ Dr. │ Dr. │ PROPN │ │
│ Smith │ Smith │ PROPN │ PERSON │
│ 's │ 's │ PART │ │
│ email │ email │ NOUN │ │
│ is │ be │ AUX │ │
│ john.smith@company.com│ john.smith@company.com│ X │ │
└──────────────────────┴────────┴─────────┴──────────┘
Special Tokens Detected:
- 1 email address
- 1 abbreviation (Dr.)
The Core Question You’re Answering
“How does a computer break human language into discrete, processable units while preserving meaning?”
Humans read “don’t” as a single word with a specific meaning. But for computation, we need to recognize it contains “do” + “not”—two separate semantic units. Tokenization is the bridge between human text and machine understanding.
Concepts You Must Understand First
Self-assessment questions:
- “What’s the difference between a string and a list of strings in Python?”
- “How would you iterate through a file line by line?”
- “What is Unicode and why might ‘café’ have different byte lengths?”
If you can’t answer these, review:
- Vasiliev Ch. 1: “Introducing spaCy”
- Python documentation on strings and file I/O
Questions to Guide Your Design
Input Handling:
- How will you handle files vs. direct text input?
- What encoding should you assume (UTF-8)?
- How large might the input be? Do you need streaming?
Output Format:
- What information should each token include?
- How will you represent the output (JSON, CSV, table)?
- Should you include sentence boundaries?
Edge Cases:
- What happens with empty input?
- How do you handle binary files accidentally passed as input?
- What about extremely long “words” (like URLs)?
Thinking Exercise
Before writing code, manually tokenize this text:
"The U.S. hasn't seen inflation like this since the '70s.
Email me at test@example.com for the full report (PDF, 2.5MB)."
Write down:
- How many tokens do you expect?
- Which tokens are “tricky” and why?
- Where are the sentence boundaries?
Now run it through spaCy and compare. What surprised you?
The Interview Questions They’ll Ask
- “Why doesn’t spaCy just split on whitespace and punctuation?”
- “How does spaCy handle contractions differently than NLTK?”
- “What’s the difference between a token and a word?”
- “How would you customize the tokenizer for a specific domain (e.g., medical abbreviations)?”
- “What’s the computational complexity of tokenization in spaCy?”
Hints in Layers
Hint 1: Getting Started
Start with nlp = spacy.load("en_core_web_sm") and doc = nlp(text). Iterate through doc to access tokens.
Hint 2: Token Attributes
Each token has attributes: token.text, token.lemma_, token.pos_, token.dep_, token.ent_type_. The underscore versions give human-readable strings.
Hint 3: Sentence Segmentation
Use doc.sents to iterate through sentences. Each sentence is a Span object containing tokens.
Hint 4: Output Formatting
Use the tabulate library for nice CLI tables, or json.dumps() with indent=2 for structured output.
Books That Will Help
| Topic | Book & Chapter |
|---|---|
| spaCy Tokenization | Vasiliev Ch. 2: “Getting Started” |
| Tokenization Theory | Jurafsky Ch. 2: “Regular Expressions, Text Normalization” |
| Unicode Handling | “Fluent Python” by Ramalho — Ch. 4: “Text vs. Bytes” |
Common Pitfalls & Debugging
| Problem | Root Cause | Fix |
|---|---|---|
OSError: Can't find model 'en_core_web_sm' |
Model not downloaded | Run python -m spacy download en_core_web_sm |
| Contractions not split | Using wrong model or custom tokenizer | Ensure using standard spaCy model |
| Memory error on large file | Loading entire file at once | Process in chunks using nlp.pipe() |
| Weird characters in output | Encoding mismatch | Ensure UTF-8 encoding when reading files |
Learning Milestones
Milestone 1: Basic Tokenization You can tokenize a sentence and print each token’s text and POS tag.
Milestone 2: File Processing You can process entire documents and output structured JSON with token counts and statistics.
Milestone 3: Edge Case Handling You understand why spaCy makes specific tokenization decisions and can explain them.
Project 2: The Grammar Detective (POS Tagger)
- Main Programming Language: Python
- Libraries: spaCy, matplotlib
- Difficulty: Beginner
- Time Estimate: 4-5 hours
What you’ll build: A tool that analyzes text and visualizes the distribution of parts of speech. It will identify patterns like noun-heavy technical documents vs. verb-heavy narratives, and highlight grammatically interesting constructions.
Why it teaches the concept: POS tagging is the gateway to understanding sentence structure. You’ll see how the same word (“bank”) can be a noun or verb depending on context, and how statistical models resolve this ambiguity.
Core challenges you’ll face:
- Ambiguity resolution → How “flies” is tagged differently in “time flies” vs “fruit flies”
- Fine-grained vs coarse tags → Understanding NN vs NNS vs NNP
- Visualization → Creating meaningful charts of POS distributions
- Comparative analysis → Comparing writing styles across documents
Key Concepts:
- Universal POS tags: NOUN, VERB, ADJ, etc.
- Penn Treebank tags: NN, VB, JJ, etc. (fine-grained)
- Morphological features: Number, tense, case
Prerequisites: Project 1 completed
Real World Outcome
A CLI tool that produces POS analysis and visualizations:
$ python pos_analyzer.py --input hemingway.txt
POS Distribution for: hemingway.txt
┌──────────┬───────┬────────────┐
│ POS Tag │ Count │ Percentage │
├──────────┼───────┼────────────┤
│ NOUN │ 2,341 │ 24.8% │
│ VERB │ 1,876 │ 19.9% │
│ DET │ 1,234 │ 13.1% │
│ ADJ │ 892 │ 9.5% │
│ PRON │ 756 │ 8.0% │
│ ADP │ 721 │ 7.6% │
│ ... │ ... │ ... │
└──────────┴───────┴────────────┘
Writing Style Analysis:
- Noun-to-verb ratio: 1.25 (narrative style)
- Adjective density: 9.5% (sparse description)
- Average sentence length: 12.3 words
Saved visualization: hemingway_pos_distribution.png
$ python pos_analyzer.py --compare hemingway.txt faulkner.txt
Comparative Analysis:
┌────────────┬───────────┬───────────┐
│ Feature │ Hemingway │ Faulkner │
├────────────┼───────────┼───────────┤
│ Adj/Noun │ 0.38 │ 0.67 │
│ Avg Sent │ 12.3 │ 34.7 │
│ Passive % │ 8.2% │ 18.4% │
└────────────┴───────────┴───────────┘
The Core Question You’re Answering
“How does context determine the grammatical role of a word?”
The word “book” is a noun in “I read a book” but a verb in “Please book the flight.” Statistical POS taggers learn these patterns from massive amounts of annotated text, achieving 97%+ accuracy on standard English.
Thinking Exercise
Consider the sentence: “The old man the boats.”
- On first reading, what POS tags would you assign?
- Re-read it. What’s the actual meaning? (Hint: it’s a complete sentence)
- How do the POS tags change with the correct interpretation?
This is a “garden path sentence”—it tricks readers into the wrong parse. Understanding such cases reveals how sophisticated language processing must be.
The Interview Questions They’ll Ask
- “What’s the difference between POS tagging and parsing?”
- “How does spaCy’s tagger handle out-of-vocabulary words?”
- “Why might a POS tagger fail on social media text?”
- “Explain the difference between fine-grained and universal POS tags.”
- “How would you evaluate POS tagger accuracy?”
Hints in Layers
Hint 1: Basic Access
Use token.pos_ for universal tags (NOUN, VERB) and token.tag_ for fine-grained Penn Treebank tags (NN, VBZ).
Hint 2: Counting
Use collections.Counter to count POS frequencies: Counter(token.pos_ for token in doc)
Hint 3: Visualization
Use matplotlib’s bar() function to create distribution charts. Sort by frequency for readability.
Hint 4: Morphology
Access token.morph for detailed features like Number=Plur|Person=3.
Common Pitfalls & Debugging
| Problem | Root Cause | Fix |
|---|---|---|
| Wrong tags for proper nouns | Case sensitivity | spaCy handles this automatically; check your input |
| “X” tags everywhere | Non-standard text | Common for URLs, emojis; filter or handle separately |
| Inconsistent tags across runs | Model randomness | Use nlp.pipe() with consistent settings |
Project 3: The Entity Extractor (NER System)
- Main Programming Language: Python
- Libraries: spaCy, displacy
- Difficulty: Beginner-Intermediate
- Time Estimate: 5-6 hours
What you’ll build: A named entity extraction system that identifies people, organizations, locations, dates, and monetary values in text. It will output structured data suitable for knowledge graphs and visualize entities in HTML.
Why it teaches the concept: NER is one of the most practically valuable NLP tasks. You’ll understand how statistical models identify entity boundaries and classify them, and why context is crucial (“Apple” the company vs. “apple” the fruit).
Core challenges you’ll face:
- Entity boundary detection → Where does “New York City” start and end?
- Entity classification → Is “Amazon” a company, river, or mythological group?
- Nested entities → “Bank of America” contains both ORG and GPE
- Visualization → Creating readable entity highlighting
Key Concepts:
- Entity types: PERSON, ORG, GPE, DATE, MONEY, etc.
- BIO tagging: Begin-Inside-Outside scheme for entity boundaries
- Entity linking: Connecting mentions to knowledge bases
Prerequisites: Projects 1-2 completed
Real World Outcome
$ python entity_extractor.py --input news_article.txt
Named Entities Found:
┌──────────────────────┬─────────┬───────────────────────────────┐
│ Entity │ Type │ Context │
├──────────────────────┼─────────┼───────────────────────────────┤
│ Tim Cook │ PERSON │ "...CEO Tim Cook announced..."│
│ Apple │ ORG │ "Apple is launching..." │
│ Cupertino │ GPE │ "...headquarters in Cupertino"│
│ September 15th │ DATE │ "...available September 15th" │
│ $999 │ MONEY │ "...starting at $999..." │
└──────────────────────┴─────────┴───────────────────────────────┘
Entity Statistics:
- PERSON: 3
- ORG: 7
- GPE: 4
- DATE: 2
- MONEY: 3
Output saved: entities.json
Visualization: entities.html (open in browser)
$ cat entities.json
{
"entities": [
{"text": "Tim Cook", "type": "PERSON", "start": 45, "end": 53},
{"text": "Apple", "type": "ORG", "start": 0, "end": 5},
...
]
}
The Core Question You’re Answering
“How does a machine identify and classify named entities in unstructured text?”
Humans effortlessly recognize “Barack Obama” as a person and “Paris” as a place. But teaching machines this requires understanding that entity recognition is fundamentally a sequence labeling problem—determining both boundaries and categories.
Thinking Exercise
Consider this text:
"The Washington Post reported that Washington denied the claims.
George Washington would have been surprised."
Identify:
- Which “Washington” is a newspaper (ORG)?
- Which is a city/government (GPE)?
- Which is a person (PERSON)?
What clues help you distinguish them? How might a machine learn these patterns?
The Interview Questions They’ll Ask
- “What’s the difference between NER and entity linking?”
- “How does spaCy handle overlapping entities?”
- “What is the BIO tagging scheme and why is it used?”
- “How would you handle NER for a language without spaces between words (like Chinese)?”
- “What’s precision vs. recall in the context of NER evaluation?”
Hints in Layers
Hint 1: Accessing Entities
Use doc.ents to get all entities. Each entity has .text, .label_, .start_char, .end_char.
Hint 2: Visualization
Use displacy.render(doc, style="ent") to generate HTML visualization. Add jupyter=True in notebooks.
Hint 3: Structured Output
Convert entities to dictionaries: [{"text": e.text, "type": e.label_, "start": e.start_char, "end": e.end_char} for e in doc.ents]
Hint 4: Entity Spans
Entities are Span objects. Access tokens within: for token in entity: print(token)
Project 4: The Dependency Visualizer
- Main Programming Language: Python
- Libraries: spaCy, displacy, networkx
- Difficulty: Intermediate
- Time Estimate: 6-7 hours
What you’ll build: A tool that parses sentences and visualizes their grammatical structure as dependency trees. It will identify subjects, objects, modifiers, and help users understand how words relate to each other.
Why it teaches the concept: Dependency parsing reveals the deep structure of language. You’ll understand why “The dog bit the man” and “The man bit the dog” have identical words but opposite meanings, and how machines understand these relationships.
Core challenges you’ll face:
- Tree visualization → Rendering complex trees readably
- Relation extraction → Finding subject-verb-object triples
- Complex sentences → Handling coordination, subordination
- Question answering → Using dependencies to answer “who did what to whom”
Key Concepts:
- Dependency relations: nsubj, dobj, prep, pobj, etc.
- Head-dependent relationships: Each word has one head
- Projective parsing: Why crossing dependencies are problematic
Prerequisites: Projects 1-3 completed
Real World Outcome
$ python dep_visualizer.py --sentence "The quick brown fox jumps over the lazy dog"
Dependency Parse:
┌────────┬─────────┬────────┬────────────┐
│ Token │ Dep Rel │ Head │ Children │
├────────┼─────────┼────────┼────────────┤
│ The │ det │ fox │ [] │
│ quick │ amod │ fox │ [] │
│ brown │ amod │ fox │ [] │
│ fox │ nsubj │ jumps │ [The,quick,brown] │
│ jumps │ ROOT │ jumps │ [fox,over] │
│ over │ prep │ jumps │ [dog] │
│ the │ det │ dog │ [] │
│ lazy │ amod │ dog │ [] │
│ dog │ pobj │ over │ [the,lazy] │
└────────┴─────────┴────────┴────────────┘
Subject-Verb-Object Triples:
- fox → jumps → (over dog)
Visualization saved: dependency_tree.svg
HTML visualization: dependency_parse.html
$ python dep_visualizer.py --question "Who jumps over what?"
Answer: "fox jumps over dog"
The Core Question You’re Answering
“How do we represent the grammatical structure of a sentence in a way machines can reason about?”
Unlike humans who process language holistically, machines need explicit structure. Dependency parsing provides a graph representation where we can traverse from any word to understand its role and relationships.
Thinking Exercise
Parse this sentence manually:
"The cat that I saw yesterday caught a mouse"
- What is the root verb?
- What is the subject of “caught”?
- What is the subject of “saw”?
- Draw the dependency tree
Now verify with spaCy. Pay attention to how relative clauses are handled.
The Interview Questions They’ll Ask
- “What’s the difference between constituency parsing and dependency parsing?”
- “How would you extract all subject-verb-object triples from a document?”
- “What is a projective dependency tree?”
- “How does spaCy’s parser handle ambiguity?”
- “What’s the time complexity of transition-based dependency parsing?”
Hints in Layers
Hint 1: Basic Navigation
Each token has token.head (its parent) and token.children (dependents). The root has token.head == token.
Hint 2: Relation Labels
token.dep_ gives the dependency relation. Key relations: nsubj (subject), dobj (direct object), prep (preposition).
Hint 3: Tree Traversal
Write a recursive function: def traverse(token, depth=0) that prints token and calls itself on children.
Hint 4: Extracting Triples
Find verbs (token.pos_ == "VERB"), then look for nsubj and dobj children to form (subject, verb, object) triples.
Project 5: The Lemmatizer vs Stemmer Showdown
- Main Programming Language: Python
- Libraries: spaCy, NLTK (for comparison)
- Difficulty: Intermediate
- Time Estimate: 4-5 hours
What you’ll build: A comparative tool that applies both lemmatization (spaCy) and stemming (Porter/Snowball) to text, showing the differences and when each is appropriate.
Why it teaches the concept: Both reduce words to base forms, but lemmatization uses vocabulary and morphology while stemming uses rules. Understanding the tradeoffs is crucial for information retrieval and text normalization.
Core challenges you’ll face:
- Lemma vs stem differences → “better” → “good” (lemma) vs “better” (stem)
- Context-dependent lemmatization → “meeting” as noun vs verb
- Performance comparison → Stemming is faster but less accurate
- Use case analysis → When to use which approach
Key Concepts:
- Lemmatization: Dictionary-based reduction to base form
- Stemming: Rule-based suffix stripping
- Morphological analysis: Understanding word forms
Prerequisites: Projects 1-2 completed
Real World Outcome
$ python lemma_vs_stem.py --input text.txt
Lemmatization vs Stemming Comparison
┌────────────┬────────────┬───────────┬──────────────┐
│ Original │ Lemma │ Stem │ Difference? │
├────────────┼────────────┼───────────┼──────────────┤
│ running │ run │ run │ No │
│ better │ good │ better │ Yes │
│ mice │ mouse │ mice │ Yes │
│ studies │ study │ studi │ Yes │
│ meeting │ meeting/meet│ meet │ Context-dep │
│ happily │ happily │ happili │ Yes │
└────────────┴────────────┴───────────┴──────────────┘
Statistics:
- Words where lemma ≠ stem: 34.2%
- Lemmatization time: 0.023s
- Stemming time: 0.008s
Recommendation for your text:
→ Use lemmatization: High accuracy needs, linguistic applications
→ Use stemming: Speed-critical search indexing
The Core Question You’re Answering
“How do we normalize word variations while preserving linguistic meaning?”
“Run”, “runs”, “running”, “ran” are all forms of the same concept. But how do we know “better” relates to “good”? Lemmatization uses linguistic knowledge while stemming uses patterns. Understanding both reveals fundamental NLP tradeoffs.
Thinking Exercise
For each word, predict the lemma and stem:
- “geese” → lemma: ? stem: ?
- “unable” → lemma: ? stem: ?
- “saw” (past tense of “see”) → lemma: ?
- “saw” (cutting tool) → lemma: ?
What does this tell you about the limitations of stemming?
The Interview Questions They’ll Ask
- “When would you use stemming over lemmatization?”
- “How does spaCy determine the correct lemma for ambiguous words?”
- “What’s the Porter Stemmer algorithm?”
- “Can you lemmatize without POS information? What are the tradeoffs?”
- “How would you handle lemmatization for a low-resource language?”
Project 6: The Sentiment Analyzer
- Main Programming Language: Python
- Libraries: spaCy, spacy-textblob or custom classifier
- Difficulty: Intermediate
- Time Estimate: 6-8 hours
What you’ll build: A sentiment analysis system that classifies text as positive, negative, or neutral, with confidence scores and explanations of which words/phrases drive the sentiment.
Why it teaches the concept: Sentiment analysis is one of the most commercially valuable NLP applications. You’ll understand both lexicon-based approaches and how to train custom classifiers for domain-specific sentiment.
Core challenges you’ll face:
- Negation handling → “not bad” is positive, not negative
- Sarcasm and irony → “Oh great, another meeting” is negative
- Domain adaptation → “sick” is negative in healthcare, positive in slang
- Aspect-based sentiment → “Great food but terrible service”
Key Concepts:
- Lexicon-based sentiment: Using word lists with polarity scores
- Machine learning approaches: Training classifiers on labeled data
- Sentiment intensity: Beyond binary positive/negative
Prerequisites: Projects 1-4 completed
Real World Outcome
$ python sentiment_analyzer.py --text "The movie was absolutely fantastic! Great acting and storyline."
Sentiment Analysis Results:
┌───────────────────────────────────────────────────────────┐
│ Overall Sentiment: POSITIVE (Confidence: 0.92) │
├───────────────────────────────────────────────────────────┤
│ Polarity Score: +0.85 (range: -1.0 to +1.0) │
│ Subjectivity: 0.78 (range: 0.0 to 1.0) │
└───────────────────────────────────────────────────────────┘
Word-Level Sentiment:
┌───────────────┬───────────┬───────────┐
│ Word │ Polarity │ Intensity │
├───────────────┼───────────┼───────────┤
│ fantastic │ +0.9 │ strong │
│ great │ +0.8 │ moderate │
└───────────────┴───────────┴───────────┘
$ python sentiment_analyzer.py --file reviews.csv --aspect-based
Aspect-Based Sentiment (100 reviews):
┌──────────────┬───────────┬───────────┬───────────┐
│ Aspect │ Positive │ Negative │ Neutral │
├──────────────┼───────────┼───────────┼───────────┤
│ food │ 78 │ 12 │ 10 │
│ service │ 45 │ 42 │ 13 │
│ price │ 23 │ 55 │ 22 │
│ ambiance │ 67 │ 18 │ 15 │
└──────────────┴───────────┴───────────┴───────────┘
The Core Question You’re Answering
“How do we automatically determine the emotional tone of text?”
Human language is nuanced—”I don’t hate it” technically has positive sentiment but weak enthusiasm. Understanding how machines approximate human judgment of tone reveals both the power and limitations of NLP.
Thinking Exercise
Rate these sentences from -1 (very negative) to +1 (very positive):
- “This is the best pizza I’ve ever had!”
- “The pizza was okay.”
- “I’ve had better pizza from a frozen box.”
- “Not the worst pizza I’ve eaten.”
Now imagine writing rules to handle all these cases. What patterns emerge? What makes it hard?
The Interview Questions They’ll Ask
- “How would you handle negation in sentiment analysis?”
- “What’s the difference between document-level and aspect-based sentiment?”
- “How would you build a sentiment classifier for a new domain with limited labeled data?”
- “What are the limitations of lexicon-based sentiment analysis?”
- “How do you handle sarcasm and irony in sentiment analysis?”
Project 7: The Text Classifier
- Main Programming Language: Python
- Libraries: spaCy, scikit-learn
- Difficulty: Intermediate
- Time Estimate: 8-10 hours
What you’ll build: A multi-class text classification system that categorizes documents into predefined categories (e.g., news topics, support tickets, email types). You’ll train custom models and evaluate their performance.
Why it teaches the concept: Text classification is fundamental to applications like spam filtering, content moderation, and routing. You’ll understand feature extraction, model training, and evaluation metrics in NLP.
Core challenges you’ll face:
- Feature engineering → Converting text to numerical features
- Class imbalance → Handling categories with few examples
- Evaluation → Precision, recall, F1 for multi-class problems
- Model selection → Comparing approaches (Naive Bayes, SVM, neural)
Key Concepts:
- Text vectorization: TF-IDF, word embeddings
- spaCy TextCategorizer: Built-in classification component
- Cross-validation: Reliable performance estimation
Prerequisites: Projects 1-6 completed, basic ML knowledge helpful
Real World Outcome
$ python text_classifier.py train --data training_data.json --model news_classifier
Training text classifier...
Categories: ['politics', 'sports', 'technology', 'entertainment', 'business']
Training samples: 10,000
Validation samples: 2,000
Training Progress:
Epoch 1: Loss=2.34, Val Accuracy=0.72
Epoch 5: Loss=0.89, Val Accuracy=0.87
Epoch 10: Loss=0.45, Val Accuracy=0.91
Model saved: news_classifier/
$ python text_classifier.py predict --model news_classifier --text "Apple announces new iPhone with revolutionary camera"
Prediction Results:
┌─────────────────┬────────────┐
│ Category │ Confidence │
├─────────────────┼────────────┤
│ technology │ 0.89 │
│ business │ 0.08 │
│ entertainment │ 0.02 │
│ politics │ 0.01 │
│ sports │ 0.00 │
└─────────────────┴────────────┘
$ python text_classifier.py evaluate --model news_classifier --data test_data.json
Classification Report:
precision recall f1-score support
politics 0.92 0.89 0.90 412
sports 0.95 0.97 0.96 389
technology 0.89 0.91 0.90 445
entertainment 0.87 0.84 0.85 367
business 0.88 0.90 0.89 387
accuracy 0.90 2000
The Core Question You’re Answering
“How do we teach a machine to recognize what a document is ‘about’?”
Unlike keyword matching, classification must understand that “Apple’s stock price surged” is about business, not fruit, even though “apple” appears. The model must learn semantic patterns from examples.
Thinking Exercise
You’re building a support ticket classifier with these categories:
- Billing
- Technical Issue
- Feature Request
- Account Access
Consider this ticket: “I can’t log in to pay my bill”
- Which category does it belong to?
- What features would help classify it?
- How would you handle tickets that span multiple categories?
The Interview Questions They’ll Ask
- “What’s the difference between text classification and named entity recognition?”
- “How do you handle class imbalance in text classification?”
- “Explain TF-IDF and why it’s useful for text classification.”
- “What’s the cold start problem in classification and how would you address it?”
- “How would you deploy a text classifier to handle 10,000 requests per second?”
Hints in Layers
Hint 1: Data Format
spaCy expects training data as: [("text", {"cats": {"category1": 1.0, "category2": 0.0}})]
Hint 2: TextCategorizer Component
Add to pipeline: textcat = nlp.add_pipe("textcat") then textcat.add_label("category_name")
Hint 3: Training Loop
Use nlp.update() with batches of examples. Track loss to monitor convergence.
Hint 4: Evaluation
Use scikit-learn’s classification_report() for detailed precision/recall/F1.
Project 8: Custom NER Model Training
- Main Programming Language: Python
- Libraries: spaCy
- Difficulty: Intermediate-Advanced
- Time Estimate: 10-12 hours
What you’ll build: A custom named entity recognition model trained on domain-specific data. You’ll annotate training data, train the model, and evaluate its performance on your specific entity types.
Why it teaches the concept: Real-world NER often requires custom entities not in pre-trained models (product names, medical terms, legal entities). You’ll understand the full training pipeline: annotation, training, evaluation, and iteration.
Core challenges you’ll face:
- Data annotation → Creating high-quality training examples
- Entity boundary consistency → Ensuring consistent spans
- Training configuration → Optimizing hyperparameters
- Model evaluation → Measuring precision, recall, F1
Key Concepts:
- spaCy training config: config.cfg structure
- Training data format: DocBin and JSON formats
- Transfer learning: Starting from pre-trained models
Prerequisites: Project 3 completed, understanding of ML training basics
Real World Outcome
$ python custom_ner.py prepare --input raw_documents/ --output training_data/
Annotation Guidelines:
- PRODUCT: Product names (iPhone, MacBook Pro)
- FEATURE: Feature names (Face ID, Retina Display)
- SPEC: Technical specifications (256GB, A15 chip)
Starting annotation tool on http://localhost:8080
$ python custom_ner.py train --config config.cfg --data training_data/
Training Custom NER Model
Entity types: ['PRODUCT', 'FEATURE', 'SPEC']
Training examples: 500
Validation examples: 100
Training Progress:
Step 0: Loss=25.4, NER P=0.12, R=0.08, F1=0.09
Step 1000: Loss=3.2, NER P=0.78, R=0.72, F1=0.75
Step 2000: Loss=1.1, NER P=0.89, R=0.85, F1=0.87
Best model saved: models/custom_ner_best/
$ python custom_ner.py predict --model models/custom_ner_best --text "The new MacBook Pro with M2 chip and 512GB SSD"
Custom Entities Found:
┌──────────────┬─────────┬───────────┬───────────┐
│ Entity │ Type │ Start │ End │
├──────────────┼─────────┼───────────┼───────────┤
│ MacBook Pro │ PRODUCT │ 8 │ 19 │
│ M2 chip │ SPEC │ 25 │ 32 │
│ 512GB SSD │ SPEC │ 37 │ 47 │
└──────────────┴─────────┴───────────┴───────────┘
The Core Question You’re Answering
“How do we teach a machine to recognize entity types it’s never seen before?”
Pre-trained models know about generic entities, but your domain has specific needs. Training custom NER requires understanding what the model learns (patterns, context) and how to provide sufficient examples for generalization.
Thinking Exercise
You need to train a NER model for legal documents with these entities:
- CASE_NUMBER (e.g., “2024-CV-1234”)
- PARTY_NAME (e.g., “Smith v. Jones”)
- LEGAL_TERM (e.g., “habeas corpus”, “prima facie”)
For each entity type:
- How many training examples would you need?
- What patterns would the model learn?
- What edge cases might be difficult?
The Interview Questions They’ll Ask
- “How much training data do you need for custom NER?”
- “What’s the difference between fine-tuning and training from scratch?”
- “How do you handle entities that overlap or nest?”
- “What strategies help when you have very few labeled examples?”
- “How do you evaluate NER model performance in production?”
Hints in Layers
Hint 1: Data Format
Use spaCy’s DocBin format for training: each example needs text and entity spans (start, end, label).
Hint 2: Config File
Start with python -m spacy init config config.cfg --lang en --pipeline ner and customize.
Hint 3: Training Command
Use python -m spacy train config.cfg --output ./models --paths.train ./train.spacy --paths.dev ./dev.spacy
Hint 4: Annotation Tools Consider using Prodigy (commercial) or doccano (open source) for efficient annotation.
Project 9: The Similarity Engine
- Main Programming Language: Python
- Libraries: spaCy, numpy
- Difficulty: Intermediate
- Time Estimate: 6-8 hours
What you’ll build: A document similarity system that finds semantically related documents using word vectors. It will support queries like “find documents similar to this one” and power applications like related article recommendations.
Why it teaches the concept: Understanding word vectors (embeddings) is fundamental to modern NLP. You’ll learn how meaning is captured in high-dimensional space and how similarity computations work.
Core challenges you’ll face:
- Vector aggregation → Combining word vectors into document vectors
- Similarity metrics → Cosine similarity vs. Euclidean distance
- Efficient search → Handling large document collections
- Out-of-vocabulary words → What happens when a word has no vector?
Key Concepts:
- Word embeddings: Dense vector representations
- Document vectors: Aggregating word vectors
- Cosine similarity: Measuring vector angles
Prerequisites: Projects 1-3 completed, basic linear algebra helpful
Real World Outcome
$ python similarity_engine.py index --documents articles/
Indexing 10,000 documents...
Building document vectors...
Index saved: document_index.pkl
$ python similarity_engine.py search --query "machine learning applications in healthcare"
Most Similar Documents:
┌─────┬───────────────────────────────────────────┬────────────┐
│ Rank│ Document │ Similarity │
├─────┼───────────────────────────────────────────┼────────────┤
│ 1 │ AI_Healthcare_Diagnosis.txt │ 0.89 │
│ 2 │ ML_Medical_Imaging.txt │ 0.85 │
│ 3 │ Deep_Learning_Drug_Discovery.txt │ 0.82 │
│ 4 │ NLP_Clinical_Notes.txt │ 0.79 │
│ 5 │ Healthcare_Analytics_Overview.txt │ 0.76 │
└─────┴───────────────────────────────────────────┴────────────┘
$ python similarity_engine.py similar --document articles/AI_Healthcare_Diagnosis.txt --n 5
Documents similar to: AI_Healthcare_Diagnosis.txt
1. ML_Medical_Imaging.txt (0.91)
2. Radiology_AI_Applications.txt (0.87)
...
The Core Question You’re Answering
“How do we measure the semantic similarity between pieces of text?”
Traditional keyword matching fails when documents use different words for the same concept. Word vectors capture meaning: “doctor” and “physician” are close in vector space even though they share no characters. This enables semantic search.
Thinking Exercise
Consider these document pairs:
- “The cat sat on the mat” vs “A feline rested on the rug”
- “Apple announces new iPhone” vs “Orange introduces new smartphone”
- “Bank robbery downtown” vs “Financial institution branch opening”
For each pair:
- What would keyword overlap score?
- What would vector similarity score (estimate)?
- Why do they differ?
The Interview Questions They’ll Ask
- “How are word vectors trained (Word2Vec, GloVe)?”
- “What’s the difference between static embeddings and contextual embeddings?”
- “How would you compute document similarity for millions of documents efficiently?”
- “What are the limitations of averaging word vectors?”
- “How would you handle out-of-vocabulary words in a similarity system?”
Hints in Layers
Hint 1: Accessing Vectors
Use token.vector for word vectors and doc.vector for the averaged document vector. Requires en_core_web_md or larger.
Hint 2: Similarity Computation
Use doc1.similarity(doc2) or compute cosine similarity manually: np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Hint 3: Efficient Search For large collections, consider approximate nearest neighbor libraries like FAISS or Annoy.
Hint 4: Better Document Vectors Weight by TF-IDF or use sentence transformers for better document representations.
Project 10: The Keyword Extractor
- Main Programming Language: Python
- Libraries: spaCy, pytextrank or custom
- Difficulty: Intermediate
- Time Estimate: 5-7 hours
What you’ll build: An automatic keyword and keyphrase extraction system using multiple algorithms (TextRank, TF-IDF, RAKE). It will identify the most important terms in documents for tagging, summarization, and SEO.
Why it teaches the concept: Keyword extraction combines multiple NLP techniques: POS tagging (to identify candidates), graph algorithms (TextRank), and statistical measures (TF-IDF). It’s a practical application with many use cases.
Core challenges you’ll face:
- Candidate selection → Which phrases could be keywords?
- Ranking algorithms → TextRank vs. TF-IDF vs. RAKE
- Phrase extraction → Multi-word keywords vs. single words
- Domain adaptation → Different domains have different important terms
Key Concepts:
- TextRank algorithm: Graph-based ranking
- TF-IDF for keywords: Term importance in corpus
- RAKE: Rapid Automatic Keyword Extraction
Prerequisites: Projects 1-4 completed
Real World Outcome
$ python keyword_extractor.py --input research_paper.txt --method all
Keyword Extraction Results for: research_paper.txt
TextRank Keywords:
┌─────┬──────────────────────────┬───────────┐
│ Rank│ Keyword/Phrase │ Score │
├─────┼──────────────────────────┼───────────┤
│ 1 │ neural network │ 0.089 │
│ 2 │ deep learning │ 0.076 │
│ 3 │ attention mechanism │ 0.068 │
│ 4 │ transformer architecture │ 0.054 │
│ 5 │ language model │ 0.051 │
└─────┴──────────────────────────┴───────────┘
TF-IDF Keywords:
1. transformer (0.234)
2. attention (0.198)
3. embedding (0.167)
...
RAKE Keywords:
1. self-attention mechanism (score: 12.5)
2. pre-trained language model (score: 11.2)
...
Agreement Analysis:
- Keywords appearing in all methods: neural network, deep learning, transformer
- Method correlation: TextRank-TFIDF=0.78, TextRank-RAKE=0.65
The Core Question You’re Answering
“How do we automatically identify the most important concepts in a document?”
Humans can quickly identify what a document is “about.” Teaching machines this requires understanding both statistical measures (term frequency) and structural patterns (word co-occurrence graphs).
Thinking Exercise
Given this abstract:
"Deep learning has revolutionized natural language processing. Transformer models
like BERT and GPT have achieved state-of-the-art results on many benchmarks.
The attention mechanism allows these models to capture long-range dependencies."
- What keywords would you extract?
- Which words are too common to be keywords (stop words)?
- Which multi-word phrases are more informative than single words?
The Interview Questions They’ll Ask
- “Explain the TextRank algorithm for keyword extraction.”
- “How would you handle keyword extraction for very short documents?”
- “What’s the difference between extractive and abstractive keyword generation?”
- “How would you evaluate keyword extraction quality?”
- “How do you handle synonyms and related terms in keyword extraction?”
Project 11: The Text Summarizer
- Main Programming Language: Python
- Libraries: spaCy, sumy or transformers
- Difficulty: Advanced
- Time Estimate: 10-12 hours
What you’ll build: A text summarization system implementing both extractive (selecting important sentences) and abstractive (generating new text) approaches. It will create concise summaries of long documents while preserving key information.
Why it teaches the concept: Summarization is one of the hardest NLP tasks because it requires understanding both content and structure. You’ll learn how to identify important information and either select or generate summary text.
Core challenges you’ll face:
- Importance scoring → Which sentences matter most?
- Coherence → Making summaries read naturally
- Compression ratio → How much to shrink the original?
- Evaluation → ROUGE scores and human evaluation
Key Concepts:
- Extractive summarization: Selecting existing sentences
- Abstractive summarization: Generating new sentences
- ROUGE metrics: Evaluating summary quality
Prerequisites: Projects 1-10 completed
Real World Outcome
$ python summarizer.py --input long_article.txt --method extractive --ratio 0.2
Original: 2,500 words
Summary: 500 words (20% compression)
Extractive Summary:
───────────────────────────────────────────────────────────────
The rise of artificial intelligence has transformed multiple industries
in the past decade. Healthcare providers now use AI for diagnosis, while
financial institutions deploy it for fraud detection. Despite these
advances, concerns about job displacement and algorithmic bias remain...
───────────────────────────────────────────────────────────────
Sentence Importance Scores:
1. "The rise of artificial intelligence..." (0.92)
2. "Healthcare providers now use AI..." (0.87)
3. "Despite these advances, concerns..." (0.84)
...
$ python summarizer.py --input long_article.txt --method abstractive --max-length 100
Abstractive Summary:
───────────────────────────────────────────────────────────────
AI has revolutionized healthcare and finance over the past decade,
enabling better diagnosis and fraud detection. However, concerns
about job loss and bias persist.
───────────────────────────────────────────────────────────────
The Core Question You’re Answering
“How do we condense information while preserving meaning and readability?”
Humans naturally summarize—we can tell a friend about a movie in one sentence or ten. Teaching machines this requires understanding what makes information “important” and how to express it concisely.
Thinking Exercise
Consider this paragraph:
"The quick brown fox jumps over the lazy dog. The fox was very fast and agile.
The dog, being lazy, didn't move at all. This sentence is about foxes and dogs."
- Which sentence is most important? Why?
- Which sentence is redundant?
- Write a one-sentence extractive summary.
- Write a one-sentence abstractive summary.
The Interview Questions They’ll Ask
- “What’s the difference between extractive and abstractive summarization?”
- “How does ROUGE evaluation work?”
- “What are the challenges of multi-document summarization?”
- “How would you handle summarization for very long documents (100+ pages)?”
- “What are the ethical considerations in automatic summarization?”
Hints in Layers
Hint 1: Extractive Approach Score sentences by: position (first sentences often important), TF-IDF of terms, presence of named entities, length.
Hint 2: TextRank for Sentences Build a graph where sentences are nodes, edges are similarity scores. PageRank identifies important sentences.
Hint 3: Abstractive with Transformers
Use transformers library with pipeline("summarization") for quick abstractive summaries.
Hint 4: Evaluation
Use rouge-score library: scorer.score(reference, hypothesis) returns ROUGE-1, ROUGE-2, ROUGE-L.
Project 12: The Question Answering System
- Main Programming Language: Python
- Libraries: spaCy, transformers
- Difficulty: Advanced
- Time Estimate: 12-15 hours
What you’ll build: An extractive question answering system that finds answers to questions within a given context document. It will identify the exact span of text that answers the question.
Why it teaches the concept: QA is the culmination of many NLP skills—understanding questions, finding relevant passages, and extracting precise answers. It’s the foundation for chatbots, search engines, and virtual assistants.
Core challenges you’ll face:
- Question understanding → What is the question actually asking?
- Passage retrieval → Finding relevant text for the question
- Answer extraction → Identifying the exact answer span
- Confidence scoring → Knowing when the answer isn’t in the text
Key Concepts:
- Extractive QA: Finding answers in existing text
- Generative QA: Creating answers from context
- Reading comprehension: Understanding text to answer questions
Prerequisites: Projects 1-11 completed, transformer basics helpful
Real World Outcome
$ python qa_system.py --context document.txt
Question Answering System Ready
Context loaded: 2,500 words
> What year was the company founded?
Answer: "1998" (Confidence: 0.94)
Context: "...the company was founded in 1998 by two Stanford students..."
> Who is the current CEO?
Answer: "Sundar Pichai" (Confidence: 0.89)
Context: "...Sundar Pichai has served as CEO since 2015..."
> What is the company's market cap?
Answer: "Not found in context" (Confidence: 0.12)
Reason: The document doesn't contain market capitalization information.
> When did the company go public?
Answer: "August 2004" (Confidence: 0.87)
Context: "...Google's IPO in August 2004 raised $1.67 billion..."
Type 'quit' to exit.
The Core Question You’re Answering
“How can a machine find specific answers to natural language questions within text?”
Unlike search (which finds relevant documents), QA finds exact answers. This requires understanding both the question’s intent and the passage’s meaning well enough to identify where the answer lies.
Thinking Exercise
Given this passage:
"Albert Einstein was born in Ulm, Germany on March 14, 1879.
He developed the theory of relativity while working at the Swiss Patent Office.
Einstein received the Nobel Prize in Physics in 1921 for his explanation of
the photoelectric effect, not for relativity as commonly believed."
For each question, identify:
- What type of answer is expected (date, person, location, etc.)?
- Which sentence contains the answer?
- What is the exact answer span?
Questions:
- “Where was Einstein born?”
- “When did Einstein win the Nobel Prize?”
- “Why did Einstein win the Nobel Prize?”
The Interview Questions They’ll Ask
- “What’s the difference between extractive and generative QA?”
- “How do transformer models like BERT approach extractive QA?”
- “What is the SQuAD dataset and why is it important?”
- “How would you handle questions that can’t be answered from the context?”
- “How would you scale QA to millions of documents?”
Hints in Layers
Hint 1: Transformer Pipeline
Use pipeline("question-answering") from transformers for quick implementation.
Hint 2: spaCy Integration Use spaCy for preprocessing: sentence segmentation, entity recognition to narrow answer candidates.
Hint 3: Confidence Threshold Set a minimum confidence threshold (e.g., 0.3) below which return “answer not found.”
Hint 4: Multi-passage For long documents, chunk into passages, run QA on each, take highest confidence answer.
Project 13: The Intent Recognition Chatbot
- Main Programming Language: Python
- Libraries: spaCy, custom classifier
- Difficulty: Advanced
- Time Estimate: 15-20 hours
What you’ll build: A task-oriented chatbot that recognizes user intents, extracts entities from queries, and responds appropriately. It will handle multiple intents like booking, querying, and canceling for a specific domain.
Why it teaches the concept: Chatbots combine intent classification, entity extraction, and dialogue management. You’ll learn how production virtual assistants work at a fundamental level.
Core challenges you’ll face:
- Intent classification → Recognizing what the user wants
- Entity extraction → Pulling out relevant parameters
- Slot filling → Getting all required information
- Dialogue management → Maintaining conversation state
Key Concepts:
- Intent recognition: Classifying user goals
- Entity slot filling: Extracting parameters
- Dialogue state tracking: Maintaining conversation context
Prerequisites: Projects 7-8 completed (text classification, custom NER)
Real World Outcome
$ python chatbot.py --domain restaurant_booking
Restaurant Booking Assistant
Type 'quit' to exit
User: I want to book a table for 4 people tomorrow evening
Bot: Great! I can help you book a table.
Intent: book_table (confidence: 0.94)
Extracted:
- party_size: 4
- date: tomorrow
- time: evening
What restaurant would you like to book at?
User: How about Italian place on Main Street
Bot: Looking for Italian restaurants on Main Street...
I found "Luigi's Trattoria" at 123 Main Street.
Confirming: Table for 4 at Luigi's Trattoria
Date: December 15, 2024 (tomorrow)
Time: Evening (7:00 PM default)
Would you like me to complete this reservation?
User: Yes please
Bot: Your reservation is confirmed!
Confirmation #: RT-2024-1215-0042
Is there anything else I can help you with?
User: Can I change it to 6 people?
Bot: Intent: modify_booking (confidence: 0.91)
I'll update your reservation from 4 to 6 people.
Updated successfully!
Session Summary:
- Intents recognized: book_table, confirm, modify_booking
- Entities extracted: party_size (2), date (1), time (1), cuisine (1), location (1)
- Successful completion: Yes
The Core Question You’re Answering
“How do we build a system that understands user requests in natural language and takes appropriate actions?”
Chatbots must bridge human language and system actions. This requires classifying what the user wants (intent), extracting relevant parameters (entities), and managing multi-turn conversations (state).
Thinking Exercise
For a flight booking chatbot, define:
- Intents (5+): What actions can users take?
- book_flight, search_flights, cancel_booking, check_status, …
- Entities (5+): What parameters are needed?
- origin_city, destination_city, departure_date, …
-
Required slots for booking: Which entities must be filled?
- Clarification questions: What do you ask when slots are missing?
The Interview Questions They’ll Ask
- “How do you handle out-of-scope queries in a chatbot?”
- “What’s the difference between intent classification and entity extraction?”
- “How do you manage dialogue state across multiple turns?”
- “How would you handle ambiguous user inputs?”
- “What metrics would you use to evaluate chatbot performance?”
Project 14: The Language Detector
- Main Programming Language: Python
- Libraries: spaCy, langdetect or custom
- Difficulty: Intermediate
- Time Estimate: 5-6 hours
What you’ll build: A language detection system that identifies the language of text, handles mixed-language documents, and provides confidence scores for multiple candidate languages.
Why it teaches the concept: Language detection is often the first step in multilingual NLP pipelines. You’ll understand how statistical patterns (character n-grams) can identify language even from short text snippets.
Core challenges you’ll face:
- Short text detection → Identifying language from just a few words
- Similar languages → Distinguishing Spanish from Portuguese
- Mixed-language text → Handling code-switching
- Unknown languages → Graceful handling of unsupported languages
Key Concepts:
- Character n-grams: Statistical language signatures
- Language models: Probability distributions over text
- Multilingual processing: Handling multiple languages
Prerequisites: Projects 1-3 completed
Real World Outcome
$ python lang_detector.py --text "Bonjour, comment allez-vous?"
Language Detection Results:
┌──────────────┬────────────┬────────────┐
│ Language │ Code │ Confidence │
├──────────────┼────────────┼────────────┤
│ French │ fr │ 0.98 │
│ Italian │ it │ 0.01 │
│ Spanish │ es │ 0.01 │
└──────────────┴────────────┴────────────┘
$ python lang_detector.py --file multilingual_doc.txt --per-sentence
Per-Sentence Language Detection:
┌─────┬─────────────────────────────────────────┬──────────┬────────────┐
│ Line│ Text │ Language │ Confidence │
├─────┼─────────────────────────────────────────┼──────────┼────────────┤
│ 1 │ Hello, how are you? │ en │ 0.99 │
│ 2 │ Je vais très bien, merci. │ fr │ 0.97 │
│ 3 │ Danke schön! │ de │ 0.95 │
│ 4 │ Muchas gracias por tu ayuda. │ es │ 0.98 │
└─────┴─────────────────────────────────────────┴──────────┴────────────┘
Language Distribution:
- English: 25%
- French: 25%
- German: 25%
- Spanish: 25%
The Core Question You’re Answering
“How can we automatically determine what language a piece of text is written in?”
Languages have distinctive statistical fingerprints—French has many “eux” and “tion” endings, German has compound words and umlauts. Character-level patterns are surprisingly effective for identification.
Thinking Exercise
Rank these text samples by detection difficulty (1=easiest, 5=hardest):
- “The quick brown fox jumps over the lazy dog” (English)
- “Hello” (English)
- “Das ist gut” (German)
- “Olá” vs “Hola” (Portuguese vs Spanish)
- “I am going al mercado to buy comida” (Code-switching)
What makes some cases harder than others?
The Interview Questions They’ll Ask
- “How does character n-gram language detection work?”
- “What minimum text length is needed for reliable detection?”
- “How would you handle code-switching (multiple languages in one text)?”
- “What’s the difference between language detection and script detection?”
- “How would you train a language detector for low-resource languages?”
Project 15: Custom Pipeline Components
- Main Programming Language: Python
- Libraries: spaCy
- Difficulty: Advanced
- Time Estimate: 8-10 hours
What you’ll build: Custom spaCy pipeline components that extend the library’s functionality—such as custom entity rulers, sentiment components, or domain-specific analyzers that integrate seamlessly into the processing pipeline.
Why it teaches the concept: Understanding spaCy’s pipeline architecture is essential for production NLP. You’ll learn how to add, modify, and optimize pipeline components for specific use cases.
Core challenges you’ll face:
- Component architecture → Understanding the spaCy component interface
- Extension attributes → Adding custom data to Doc, Span, Token
- Pipeline order → Understanding component dependencies
- Serialization → Saving and loading custom components
Key Concepts:
- Factory pattern: Creating reusable components
- Extension attributes:
Doc.set_extension(),Span.set_extension() - Pipeline configuration: config.cfg for component settings
Prerequisites: Projects 1-8 completed, solid spaCy understanding
Real World Outcome
$ python custom_pipeline.py --demo
Custom Pipeline Components Demo
Loading pipeline with custom components:
- 'sentiment': Adds sentiment score to each sentence
- 'profanity_filter': Flags inappropriate content
- 'acronym_resolver': Expands common acronyms
- 'readability': Calculates Flesch-Kincaid score
Processing: "The CEO announced that ROI exceeded expectations. This is great news!"
Pipeline Output:
┌────────────────────────────────────────────────────────────────────┐
│ Standard spaCy Output: │
│ Entities: CEO (PERSON implied), ROI (not recognized by default) │
│ POS Tags: DET NOUN VERB SCONJ NOUN VERB NOUN PUNCT ... │
├────────────────────────────────────────────────────────────────────┤
│ Custom Extensions: │
│ │
│ doc._.sentiment_score: 0.75 (positive) │
│ doc._.readability_score: 58.2 (fairly easy to read) │
│ doc._.profanity_found: False │
│ doc._.acronyms_found: [('ROI', 'Return on Investment')] │
│ │
│ Per-sentence sentiment: │
│ - Sent 1: "The CEO announced..." (neutral: 0.12) │
│ - Sent 2: "This is great news!" (positive: 0.89) │
└────────────────────────────────────────────────────────────────────┘
$ python -c "import spacy; nlp = spacy.load('en_core_web_sm'); nlp.add_pipe('sentiment'); print(nlp.pipe_names)"
['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'ner', 'sentiment']
The Core Question You’re Answering
“How do we extend spaCy’s capabilities for domain-specific needs?”
spaCy provides a powerful foundation, but real applications need customization. Understanding the component system lets you add any functionality while maintaining spaCy’s speed and integration.
Thinking Exercise
Design a custom pipeline component for analyzing legal documents:
- What custom extension attributes would you add?
doc._.legal_terms: List of legal terminology founddoc._.contract_type: Classification of document type- …
- What existing components does your component depend on?
- Where in the pipeline should it run (before/after what)?
- How would you handle serialization for model deployment?
The Interview Questions They’ll Ask
- “What’s the difference between a pipeline component and an extension attribute?”
- “How does spaCy’s
@Language.componentdecorator work?” - “What is the
config.cfgsystem and why is it important?” - “How would you add a custom component that needs a machine learning model?”
- “What are the performance implications of adding custom components?”
Hints in Layers
Hint 1: Basic Component
@Language.component("my_component")
def my_component(doc):
# Process doc
return doc
nlp.add_pipe("my_component")
Hint 2: Extension Attributes
Doc.set_extension("my_attr", default=None)
doc._.my_attr = computed_value
Hint 3: Factory Pattern
@Language.factory("configurable_component")
def create_component(nlp, name, threshold: float):
return ConfigurableComponent(threshold)
Hint 4: Serialization
Implement to_disk() and from_disk() methods for saving component state.
Project 16: Training spaCy Models from Scratch
- Main Programming Language: Python
- Libraries: spaCy
- Difficulty: Advanced
- Time Estimate: 15-20 hours
What you’ll build: Train spaCy models from scratch for a specific domain or language, including tokenization rules, POS tagging, NER, and dependency parsing. You’ll understand the full model training pipeline.
Why it teaches the concept: While pre-trained models work well for general text, many applications need domain-specific models. Understanding training from scratch reveals how spaCy’s models actually work.
Core challenges you’ll face:
- Training data preparation → Converting annotations to spaCy format
- Config optimization → Tuning hyperparameters for your task
- Evaluation → Measuring model quality and diagnosing issues
- Iteration → Improving models based on error analysis
Key Concepts:
- Training config system: config.cfg structure and settings
- Corpus preparation: Creating training and dev sets
- Model architectures: Understanding network options
Prerequisites: Projects 8, 15 completed, ML training experience
Real World Outcome
$ python train_model.py init-config --lang en --pipeline ner,textcat
Generated: config.cfg
$ python train_model.py prepare-data --input annotated_data.json --output corpus/
Preparing training corpus...
Training examples: 8,000
Development examples: 2,000
Saved: corpus/train.spacy, corpus/dev.spacy
$ python -m spacy train config.cfg --output models/my_model --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy
Training Pipeline: ['ner', 'textcat']
================================================================================
E # LOSS NER LOSS TEXTC ENTS_F CATS_SCORE SCORE
--- ------ -------- ---------- ------ ---------- ------
0 0 25.12 32.45 0.12 0.34 0.23
1 200 12.45 18.23 0.45 0.56 0.50
2 400 8.32 12.11 0.62 0.68 0.65
3 600 5.21 8.45 0.74 0.76 0.75
4 800 3.45 6.23 0.81 0.82 0.81
5 1000 2.12 4.56 0.86 0.87 0.86
================================================================================
Best model saved: models/my_model/model-best
$ python -m spacy evaluate models/my_model/model-best corpus/test.spacy
================================== Results ==================================
TOK 100.00
NER P 86.23
NER R 84.56
NER F 85.38
TEXTCAT (macro F1) 87.12
SPEED 12,543 words/sec
The Core Question You’re Answering
“How do we create NLP models tailored to specific domains and tasks?”
Pre-trained models are trained on general web text. Legal documents, medical records, and scientific papers have different language patterns. Training custom models captures these domain-specific patterns.
Thinking Exercise
You need to build an NLP model for analyzing social media posts about products:
- Data requirements: How many examples do you need for each component?
- NER: ~500+ examples per entity type
- Classification: ~1000+ examples per category
-
Annotation strategy: How will you create training data efficiently?
-
Evaluation plan: How will you know the model is good enough?
- Iteration strategy: How will you improve weak areas?
The Interview Questions They’ll Ask
- “What’s the minimum training data needed for a custom NER model?”
- “How do you choose hyperparameters for spaCy training?”
- “What’s transfer learning and how does spaCy use it?”
- “How would you handle a model that has good precision but poor recall?”
- “What’s the difference between training from scratch vs fine-tuning?”
Project 17: spaCy with Transformers
- Main Programming Language: Python
- Libraries: spaCy, spacy-transformers
- Difficulty: Advanced
- Time Estimate: 12-15 hours
What you’ll build: Integrate transformer models (BERT, RoBERTa) into spaCy pipelines for state-of-the-art accuracy on NER, classification, and other tasks. You’ll understand when transformers are worth the computational cost.
Why it teaches the concept: Transformers have revolutionized NLP accuracy but are computationally expensive. Understanding how to integrate them with spaCy lets you build high-accuracy production systems.
Core challenges you’ll face:
- Model selection → Choosing the right transformer for your task
- Performance tradeoffs → Accuracy vs. speed vs. memory
- Fine-tuning → Adapting pre-trained transformers to your domain
- Deployment → Running transformers in production
Key Concepts:
- Transformer architecture basics: Attention, encoders
- spacy-transformers: Integration layer
- Model distillation: Making transformers production-ready
Prerequisites: Projects 1-16 completed, understanding of deep learning helpful
Real World Outcome
$ python transformer_nlp.py compare --input test_data.txt
Model Comparison: Standard vs Transformer
=========================================
Task: Named Entity Recognition
┌─────────────────┬────────────┬────────────┬────────────┐
│ Model │ F1 Score │ Speed │ Memory │
├─────────────────┼────────────┼────────────┼────────────┤
│ en_core_web_sm │ 0.84 │ 10,500/sec │ 50MB │
│ en_core_web_trf │ 0.92 │ 150/sec │ 1.2GB │
└─────────────────┴────────────┴────────────┴────────────┘
Task: Text Classification
┌─────────────────┬────────────┬────────────┬────────────┐
│ Model │ Accuracy │ Speed │ Memory │
├─────────────────┼────────────┼────────────┼────────────┤
│ en_core_web_sm │ 0.87 │ 8,200/sec │ 50MB │
│ en_core_web_trf │ 0.95 │ 120/sec │ 1.2GB │
└─────────────────┴────────────┴────────────┴────────────┘
Recommendation:
- Use transformer if: Accuracy is critical, batch processing OK, have GPU
- Use standard if: Real-time requirements, limited memory, high volume
$ python transformer_nlp.py fine-tune --base roberta-base --data domain_data/ --output custom_trf/
Fine-tuning RoBERTa for domain-specific NER...
Base model: roberta-base
Training examples: 5,000
Epoch 1: Loss=2.34, F1=0.75
Epoch 2: Loss=1.23, F1=0.85
Epoch 3: Loss=0.67, F1=0.91
Model saved: custom_trf/
Domain-specific F1: 0.91 (vs 0.78 with generic model)
The Core Question You’re Answering
“When and how should we use transformers for NLP tasks?”
Transformers achieve state-of-the-art accuracy but are 50-100x slower than traditional models. Understanding this tradeoff—and how to mitigate it—is essential for production NLP systems.
Thinking Exercise
You’re building an NLP system for processing customer support tickets:
- Volume: 100,000 tickets/day
- Latency requirement: < 200ms response time
- Accuracy need: High (affects customer routing)
Questions:
- Can you use transformers? What constraints?
- What architecture would you design?
- How would you handle the accuracy vs. speed tradeoff?
The Interview Questions They’ll Ask
- “What is attention and why is it important for NLP?”
- “When would you choose a transformer model over a traditional approach?”
- “How do you deploy transformer models in production?”
- “What is model distillation and when would you use it?”
- “Explain the difference between BERT, RoBERTa, and DistilBERT.”
Hints in Layers
Hint 1: Installation
pip install spacy-transformers
python -m spacy download en_core_web_trf
Hint 2: Basic Usage
nlp = spacy.load("en_core_web_trf")
doc = nlp("Text to process")
# Works just like regular spaCy
Hint 3: Custom Transformer Config
[components.transformer]
@architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
Hint 4: GPU Acceleration
Enable GPU with spacy.prefer_gpu() before loading the model for 5-10x speedup.
Project 18: Production Deployment Pipeline
- Main Programming Language: Python
- Libraries: spaCy, FastAPI, Docker
- Difficulty: Advanced
- Time Estimate: 15-20 hours
What you’ll build: A production-ready NLP API service that handles high-volume text processing with proper error handling, monitoring, scaling, and deployment. This is the capstone that ties everything together.
Why it teaches the concept: Building NLP models is only half the battle—deploying them reliably at scale is equally important. You’ll learn the engineering required for production NLP systems.
Core challenges you’ll face:
- API design → Creating efficient, well-documented endpoints
- Scalability → Handling thousands of concurrent requests
- Monitoring → Tracking performance and errors
- Model updates → Deploying new models without downtime
Key Concepts:
- REST API design: Efficient endpoints for NLP
- Containerization: Docker for reproducible deployments
- Load balancing: Handling high request volumes
Prerequisites: All previous projects, basic DevOps knowledge
Real World Outcome
$ docker build -t nlp-api .
$ docker run -p 8000:8000 nlp-api
NLP API Service Starting...
Loading models: en_core_web_md
Health check: OK
Listening on: http://0.0.0.0:8000
$ curl -X POST "http://localhost:8000/api/v1/analyze" \
-H "Content-Type: application/json" \
-d '{"text": "Apple CEO Tim Cook announced new products.", "tasks": ["ner", "pos"]}'
{
"success": true,
"request_id": "req_abc123",
"processing_time_ms": 45,
"results": {
"entities": [
{"text": "Apple", "label": "ORG", "start": 0, "end": 5},
{"text": "Tim Cook", "label": "PERSON", "start": 10, "end": 18}
],
"tokens": [
{"text": "Apple", "pos": "PROPN", "lemma": "Apple"},
{"text": "CEO", "pos": "NOUN", "lemma": "ceo"},
...
]
}
}
$ curl "http://localhost:8000/api/v1/health"
{
"status": "healthy",
"model_loaded": true,
"requests_processed": 1247,
"average_latency_ms": 38,
"memory_usage_mb": 512
}
$ curl "http://localhost:8000/api/v1/batch" \
-H "Content-Type: application/json" \
-d '{"texts": ["Text 1...", "Text 2...", "Text 3..."], "tasks": ["ner"]}'
{
"success": true,
"batch_size": 3,
"total_time_ms": 89,
"results": [...]
}
The Core Question You’re Answering
“How do we serve NLP models reliably at scale in production?”
A model that works in a Jupyter notebook is far from production-ready. Production requires handling failures, managing resources, monitoring performance, and updating without downtime.
Thinking Exercise
Design a production NLP service for a company processing 1 million documents per day:
- Architecture: How many instances? Load balancing?
- Latency: What P99 latency can you achieve?
- Reliability: How do you handle model failures?
- Updates: How do you deploy new models?
- Monitoring: What metrics do you track?
The Interview Questions They’ll Ask
- “How would you scale an NLP service to handle 10,000 requests per second?”
- “What’s the difference between horizontal and vertical scaling for NLP?”
- “How do you handle model versioning in production?”
- “What monitoring would you set up for an NLP API?”
- “How do you ensure high availability for an NLP service?”
Hints in Layers
Hint 1: FastAPI Setup
from fastapi import FastAPI
import spacy
app = FastAPI()
nlp = spacy.load("en_core_web_md")
@app.post("/analyze")
async def analyze(text: str):
doc = nlp(text)
return {"entities": [(e.text, e.label_) for e in doc.ents]}
Hint 2: Batch Processing
Use nlp.pipe(texts) for efficient batch processing instead of processing one at a time.
Hint 3: Async Handling
Use concurrent.futures.ProcessPoolExecutor for CPU-bound NLP tasks to not block the event loop.
Hint 4: Docker Configuration
FROM python:3.10-slim
RUN pip install spacy fastapi uvicorn
RUN python -m spacy download en_core_web_md
COPY . /app
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Project Comparison Table
| Project | Difficulty | Time | Key Skills | Real-World Application |
|---|---|---|---|---|
| 1. Smart Tokenizer | Beginner | 3-4h | Tokenization, spaCy basics | Text preprocessing |
| 2. Grammar Detective | Beginner | 4-5h | POS tagging, visualization | Writing analysis |
| 3. Entity Extractor | Beginner-Int | 5-6h | NER, displacy | Information extraction |
| 4. Dependency Visualizer | Intermediate | 6-7h | Parsing, tree structures | Sentence analysis |
| 5. Lemma vs Stem | Intermediate | 4-5h | Morphology, comparison | Search indexing |
| 6. Sentiment Analyzer | Intermediate | 6-8h | Classification, lexicons | Social media analysis |
| 7. Text Classifier | Intermediate | 8-10h | ML training, evaluation | Content categorization |
| 8. Custom NER | Int-Advanced | 10-12h | Annotation, training | Domain-specific extraction |
| 9. Similarity Engine | Intermediate | 6-8h | Vectors, embeddings | Document search |
| 10. Keyword Extractor | Intermediate | 5-7h | Graph algorithms, TF-IDF | SEO, tagging |
| 11. Text Summarizer | Advanced | 10-12h | Extractive/abstractive | Content condensation |
| 12. QA System | Advanced | 12-15h | Reading comprehension | Virtual assistants |
| 13. Intent Chatbot | Advanced | 15-20h | Dialogue management | Customer support |
| 14. Language Detector | Intermediate | 5-6h | Statistical patterns | Multilingual processing |
| 15. Custom Pipelines | Advanced | 8-10h | spaCy architecture | Custom NLP tools |
| 16. Training from Scratch | Advanced | 15-20h | Full training pipeline | Domain adaptation |
| 17. Transformers | Advanced | 12-15h | BERT, fine-tuning | High-accuracy NLP |
| 18. Production Deploy | Advanced | 15-20h | API, Docker, scaling | Enterprise deployment |
Summary and Learning Paths
By Difficulty Level
Beginner Path (Projects 1-3) Start here if you’re new to NLP. You’ll learn fundamental concepts and spaCy basics.
- Time: 2-3 weeks
- Outcome: Basic NLP pipeline understanding
Intermediate Path (Projects 4-10) For those with Python experience who want practical NLP skills.
- Time: 4-6 weeks
- Outcome: Can build custom NLP applications
Advanced Path (Projects 11-18) For those ready to build production-grade NLP systems.
- Time: 8-12 weeks
- Outcome: Production-ready NLP engineer
By Application Area
Information Extraction: Projects 3, 8, 9, 10 Classification & Sentiment: Projects 6, 7, 14 Understanding & Generation: Projects 11, 12, 13 Production Engineering: Projects 15, 16, 17, 18
Essential Reading Checklist
- “Natural Language Processing with Python and spaCy” by Yuli Vasiliev (No Starch Press)
- spaCy official documentation (spacy.io/usage)
- freeCodeCamp NLP Course by Dr. William Mattingly
- “Speech and Language Processing” by Jurafsky & Martin (free online)
- Explosion AI blog (explosion.ai/blog)
Appendix: Quick Reference
spaCy Model Sizes
| Model | Size | Speed | Accuracy | Use Case |
|---|---|---|---|---|
en_core_web_sm |
12MB | Fastest | Good | Development, prototyping |
en_core_web_md |
40MB | Fast | Better | General use, has vectors |
en_core_web_lg |
560MB | Medium | Best (non-trf) | Production, similarity |
en_core_web_trf |
438MB | Slowest | Best | High-accuracy requirements |
Common Token Attributes
| Attribute | Type | Description |
|---|---|---|
text |
str | Original text |
lemma_ |
str | Base form |
pos_ |
str | Coarse POS tag |
tag_ |
str | Fine-grained POS tag |
dep_ |
str | Dependency relation |
ent_type_ |
str | Entity type |
is_stop |
bool | Is stop word |
is_punct |
bool | Is punctuation |
vector |
ndarray | Word vector |
Entity Types (English Models)
| Type | Description | Example |
|---|---|---|
| PERSON | People | “Barack Obama” |
| ORG | Organizations | “Apple Inc.” |
| GPE | Countries, cities | “France” |
| LOC | Non-GPE locations | “Mount Everest” |
| DATE | Dates | “June 5, 2023” |
| TIME | Times | “3:00 PM” |
| MONEY | Monetary values | “$500” |
| PERCENT | Percentages | “25%” |
| PRODUCT | Products | “iPhone” |
| EVENT | Events | “Olympics” |
| WORK_OF_ART | Titles | “The Matrix” |
| LAW | Laws | “First Amendment” |
| LANGUAGE | Languages | “English” |
Dependency Relations
| Relation | Description | Example |
|---|---|---|
| nsubj | Nominal subject | “John runs” |
| dobj | Direct object | “I see him” |
| iobj | Indirect object | “Give me the book” |
| prep | Prepositional modifier | “the book on the table” |
| pobj | Object of preposition | “on the table” |
| amod | Adjectival modifier | “the red car” |
| det | Determiner | “the book” |
| ROOT | Root of sentence | Typically the main verb |
Last updated: 2026-01-01 Total projects: 18 | Estimated completion time: 100-150 hours | Difficulty range: Beginner to Advanced