LEARN JUPYTER NOTEBOOKS DEEP DIVE
Learn Jupyter Notebooks: From Zero to Interactive Computing Master
Goal: Deeply understand Jupyter Notebooks—from basic usage and why they exist, to building interactive data science workflows, visualization dashboards, reproducible research documents, and understanding the underlying kernel architecture.
Why Learn Jupyter Notebooks?
Jupyter Notebooks represent a paradigm shift in how we write, test, and share code. Unlike traditional code files (.py, .js, .c), Jupyter Notebooks are interactive documents that blend code, output, visualizations, and narrative text in a single shareable artifact.
The Problem with Pure Code Files
Traditional programming workflow:
Write code → Run entire file → See output → Modify code → Run again
This creates friction for:
- Exploration: You want to test one idea quickly, but must run the whole file
- Visualization: Plots appear in separate windows, not alongside your code
- Documentation: Comments are separate from rendered explanations
- Sharing: Colleagues see code, but not the results unless they run it themselves
- Teaching: Students can’t see the thought process, only the final result
The Notebook Solution
Jupyter Notebooks workflow:
Write cell → Run cell → See output immediately → Continue experimenting → Share document with code AND results
Why People Choose Notebooks Over Pure Code
| Aspect | Pure Code Files | Jupyter Notebooks |
|---|---|---|
| Execution | Run entire file | Run cells individually |
| Feedback | Output in terminal | Output inline with code |
| Visualization | Separate windows | Embedded in document |
| Documentation | Comments only | Markdown, LaTeX, images |
| Sharing | Code only | Code + outputs + narrative |
| Exploration | Restart for each change | Iterate on state |
| Reproducibility | Re-run everything | Clear + Run All |
Who Uses Jupyter Notebooks?
- Data Scientists: 90%+ use notebooks for exploratory data analysis
- Machine Learning Engineers: Model prototyping and experimentation
- Scientists/Researchers: Reproducible research documents
- Educators: Interactive teaching materials
- Analysts: Reports that combine code, data, and insights
- Engineers: API testing, prototyping, documentation
After completing these projects, you will:
- Understand why notebooks exist and when to use them
- Master interactive data exploration and visualization
- Build reproducible research documents
- Create interactive dashboards and widgets
- Understand the kernel architecture (how code actually runs)
- Know when notebooks are NOT the right tool
- Share your work professionally
Core Concept Analysis
The Notebook Architecture
┌─────────────────────────────────────────────────────────────────────────┐
│ BROWSER / JUPYTER LAB │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Notebook Interface │ │
│ │ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ [Markdown Cell] │ │ │
│ │ │ # My Analysis │ │ │
│ │ │ This notebook explores... │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ [Code Cell] [Run ▶] │ │ │
│ │ │ import pandas as pd │ │ │
│ │ │ df = pd.read_csv('data.csv') │ │ │
│ │ │ df.head() │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ │ ┌─────────────────────────────────────────────────────────────┐ │ │
│ │ │ [Output] │ │ │
│ │ │ name age salary │ │ │
│ │ │ 0 Alice 28 50000 │ │ │
│ │ │ 1 Bob 35 65000 │ │ │
│ │ │ 2 Carol 42 75000 │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
│
│ WebSocket (ZMQ)
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ JUPYTER SERVER │
│ │
│ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │
│ │ Python Kernel │ │ R Kernel │ │ Julia Kernel │ │
│ │ │ │ │ │ │ │
│ │ Executes code │ │ Executes code │ │ Executes code │ │
│ │ Maintains │ │ Maintains │ │ Maintains │ │
│ │ state (vars) │ │ state (vars) │ │ state (vars) │ │
│ └────────────────┘ └────────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Key Concepts Explained
1. Cells: The Building Blocks
┌─────────────────────────────────────────────────────────────────────┐
│ CELL TYPES │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ CODE CELL │ MARKDOWN CELL │
│ ───────────────── │ ─────────────────── │
│ Contains executable code │ Contains formatted text │
│ Runs in the kernel │ Rendered as HTML │
│ Shows output below │ Supports LaTeX math │
│ │ Supports images, links │
│ Example: │ Example: │
│ x = 10 │ # Section Title │
│ print(x * 2) │ This is **bold** text │
│ → 20 │ Formula: $E = mc^2$ │
│ │ │
└─────────────────────────────────────────────────────────────────────┘
2. Kernel: The Execution Engine
The kernel is a separate process that executes your code. Key properties:
┌─────────────────────────────────────────────────────────────────────┐
│ KERNEL STATE │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ # Cell 1: Executed first │
│ x = 10 │
│ │
│ # Cell 2: Executed second │
│ y = x + 5 # x is still in memory! │
│ │
│ # Cell 3: Executed third │
│ print(y) → 15 │
│ │
│ ═══════════════════════════════════════════════════════════════ │
│ │
│ KERNEL MEMORY: │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ x → 10 │ │
│ │ y → 15 │ │
│ │ (All variables persist until kernel restart) │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────┘
Critical Understanding: Execution order matters, not cell position!
Cell [1]: x = 5 # Executed 1st
Cell [3]: z = x + y # Executed 3rd (uses values from 1st and 2nd)
Cell [2]: y = 10 # Executed 2nd
The numbers in brackets show execution order, not position.
3. The .ipynb File Format
Notebooks are stored as JSON files:
{
"cells": [
{
"cell_type": "markdown",
"source": ["# My Notebook\n", "This is text"]
},
{
"cell_type": "code",
"source": ["x = 10\n", "print(x)"],
"outputs": [
{
"output_type": "stream",
"text": ["10\n"]
}
],
"execution_count": 1
}
],
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
}
}
This format enables:
- Git diff: See what changed (though it’s noisy)
- Output storage: Results saved with code
- Metadata: Kernel info, widgets state
4. Jupyter Ecosystem
┌─────────────────────────────────────────────────────────────────────┐
│ JUPYTER ECOSYSTEM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ INTERFACES: │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Jupyter │ │ JupyterLab │ │ VSCode │ │
│ │ Notebook │ │ (Modern) │ │ Extension │ │
│ │ (Classic) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ KERNELS: │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Python │ │ R │ │ Julia │ │
│ │ (ipykernel)│ │ (IRkernel) │ │ (IJulia) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Scala │ │ JavaScript│ │ C++ │ │
│ │ (Almond) │ │ (IJavaSc) │ │ (xeus-cling│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ EXTENSIONS: │
│ • nbextensions (Notebook) • jupyterlab-git │
│ • widgets (ipywidgets) • jupyterlab-code-formatter │
│ • voila (dashboards) • variable-inspector │
│ │
│ CLOUD PLATFORMS: │
│ • Google Colab • Kaggle Notebooks │
│ • AWS SageMaker • Azure Notebooks │
│ • Binder • Deepnote │
│ │
└─────────────────────────────────────────────────────────────────────┘
5. When NOT to Use Notebooks
Notebooks are powerful but not always appropriate:
| Use Notebooks For | Use Pure Code For |
|---|---|
| Exploration | Production systems |
| Prototyping | Libraries/packages |
| Teaching | CLI tools |
| Reports | Large applications |
| Visualization | Version-controlled code |
| Quick experiments | Team collaboration |
Project List
The following 14 projects will teach you Jupyter Notebooks from basics to advanced interactive computing.
Project 1: Interactive Data Explorer
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Julia
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Knowledge Area: Data Exploration / Pandas
- Software or Tool: Jupyter Notebook, Pandas, Matplotlib
- Main Book: “Python for Data Analysis” by Wes McKinney
What you’ll build: An interactive data exploration notebook that loads a dataset (CSV, JSON, or Excel), performs summary statistics, handles missing values, creates visualizations, and documents findings—all in one shareable document.
Why it teaches Jupyter: This is the core use case for notebooks. You’ll immediately understand why data scientists prefer notebooks over plain Python scripts: see your data transformations instantly, iterate on visualizations, and create a document that tells a story.
Core challenges you’ll face:
- Cell execution order → maps to understanding kernel state and variable persistence
- Inline visualization → maps to matplotlib integration with %matplotlib inline
- DataFrame display → maps to rich output representation
- Mixing code and narrative → maps to markdown cells for documentation
Resources for key challenges:
- “Python for Data Analysis” Chapter 4 - IPython and Jupyter basics
- Jupyter Documentation - User Interface
Key Concepts:
- Cell Execution: Jupyter Documentation - Running Code
- Magic Commands: IPython Documentation - Built-in Magics
- Inline Plotting: Matplotlib Documentation - Interactive Mode
- DataFrame Display: Pandas Documentation - Options and Settings
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of CSV files
Real world outcome:
Your notebook will display:
# Sales Data Analysis - Q4 2024
## 1. Data Loading
[Code] df = pd.read_csv('sales.csv')
df.head()
[Output] Beautiful formatted table showing first 5 rows
## 2. Summary Statistics
[Code] df.describe()
[Output] Count, mean, std, min, max for all numeric columns
## 3. Missing Values Analysis
[Code] df.isnull().sum()
[Output] Column-by-column missing value counts
## 4. Visualization
[Code] plt.figure(figsize=(12, 6))
df.plot(kind='bar')
[Output] Bar chart embedded directly in the notebook
## 5. Key Findings
[Markdown] Based on the analysis:
- Sales peaked in November
- Region X underperformed by 15%
- 3 outliers identified for investigation
Implementation Hints:
Start by understanding the notebook interface:
- Create a new notebook (Kernel → Python 3)
- First cell: Import libraries with
%matplotlib inlinemagic - Second cell: Load data with pandas
- Use Shift+Enter to run cells and move to next
- Use markdown cells (change cell type) for documentation
Key questions to explore:
- What happens if you run cell 3 before cell 2?
- What is the
[*]that appears while a cell is running? - How do you restart the kernel and clear all outputs?
- What’s the difference between
df(displays nicely) andprint(df)(plain text)?
Magic commands to learn:
%matplotlib inline # Plots appear in notebook
%timeit expression # Time how long code takes
%who # List all variables
%reset # Clear all variables
%%time # Time the entire cell
Learning milestones:
- Run your first cell → Understand cell execution
- Create inline plot → See why notebooks beat scripts
- Mix code and markdown → Build a narrative document
- Share the .ipynb file → Others see your code AND results
Project 2: Literate Programming Tutorial
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python (with Markdown)
- Alternative Programming Languages: Any with Jupyter kernel
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 1: Beginner
- Knowledge Area: Documentation / Technical Writing
- Software or Tool: Jupyter Notebook, nbconvert
- Main Book: “The Pragmatic Programmer” by Hunt and Thomas
What you’ll build: A complete programming tutorial that teaches a concept (like sorting algorithms or regex) using a notebook—combining explanations, code examples, exercises, and solutions.
Why it teaches Jupyter: Notebooks originated from the idea of “literate programming” (Donald Knuth): code should be embedded in documentation, not documentation embedded in code. Building a tutorial forces you to use all of Jupyter’s documentation features.
Core challenges you’ll face:
- Markdown mastery → maps to headers, lists, emphasis, code blocks
- LaTeX for math → maps to inline and block equations
- Exercise design → maps to code cells for practice
- Export to HTML/PDF → maps to nbconvert for sharing
Resources for key challenges:
Key Concepts:
- Literate Programming: “Literate Programming” by Donald Knuth
- Markdown Syntax: GitHub Markdown Guide
- LaTeX Math: Overleaf LaTeX Documentation
- nbconvert: Jupyter Documentation - Converting Notebooks
Difficulty: Beginner Time estimate: 3-5 days Prerequisites: Project 1 (Interactive Data Explorer)
Real world outcome:
Your tutorial notebook will contain:
# Understanding Sorting Algorithms
## Introduction
Sorting is fundamental to computer science. We'll explore three algorithms:
- Bubble Sort: O(n²) - Simple but slow
- Merge Sort: O(n log n) - Divide and conquer
- Quick Sort: O(n log n) average - Fast in practice
## Big O Notation
The time complexity is expressed as:
$$T(n) = O(f(n))$$
Where $f(n)$ describes how time grows with input size $n$.
## Bubble Sort
### Theory
Compare adjacent elements and swap if out of order...
### Implementation
[Code Cell - Students fill in]
### Visualization
[Code Cell - Animated sorting visualization]
## Exercises
1. Implement bubble sort
2. Count comparisons
3. Compare with built-in `sorted()`
## Solutions (Hidden by default)
[Code Cell - Solutions]
Implementation Hints:
Markdown essentials:
# Heading 1
## Heading 2
**bold** and *italic*
- Bullet list
1. Numbered list
`inline code`
```python
code block
```
[Link text](url)

> Blockquote
--- (horizontal rule)
LaTeX math:
Inline: $E = mc^2$
Block:
$$
\sum_{i=1}^{n} i = \frac{n(n+1)}{2}
$$
Export your tutorial:
# Convert to HTML
jupyter nbconvert --to html tutorial.ipynb
# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf tutorial.ipynb
# Convert to slides
jupyter nbconvert --to slides tutorial.ipynb --post serve
Learning milestones:
- Master markdown → Headers, lists, emphasis
- Add math equations → LaTeX integration
- Include images → Visual explanations
- Export to HTML → Share with non-Jupyter users
Project 3: Reproducible Research Document
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Julia
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Scientific Computing / Reproducibility
- Software or Tool: Jupyter, Papermill, requirements.txt
- Main Book: “Reproducible Research with R and RStudio” by Christopher Gandrud
What you’ll build: A complete reproducible analysis that downloads data from an API, performs statistical analysis, generates publication-quality figures, and can be re-run by anyone to verify results.
Why it teaches Jupyter: The “reproducibility crisis” in science stems from analyses that can’t be replicated. This project teaches you to build notebooks that anyone can run and get the same results—essential for research and professional analysis.
Core challenges you’ll face:
- Environment management → maps to conda/pip requirements
- Random seed control → maps to reproducible random numbers
- API data fetching → maps to handling external data
- Figure export → maps to publication-ready graphics
Resources for key challenges:
- “Python for Data Analysis” Chapter 10 - Data Loading
- Jupyter Reproducibility Guide
Key Concepts:
- Virtual Environments: Python Packaging Guide
- Random Seeds: NumPy random documentation
- Data Caching: Requests-cache library
- Figure Export: Matplotlib savefig documentation
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-2, understanding of statistics basics
Real world outcome:
Repository structure:
project/
├── README.md
│ "How to reproduce this analysis"
├── requirements.txt
│ pandas==2.0.0
│ matplotlib==3.7.0
│ requests==2.28.0
├── environment.yml
│ (conda environment)
├── data/
│ └── .gitkeep
│ (data downloaded on run)
├── figures/
│ └── figure1.png
│ └── figure2.pdf
├── analysis.ipynb
│ (your reproducible notebook)
└── run_analysis.sh
"pip install -r requirements.txt"
"jupyter nbconvert --execute analysis.ipynb"
When others run your notebook:
$ git clone your-repo
$ pip install -r requirements.txt
$ jupyter nbconvert --execute analysis.ipynb
# → Same results, same figures, verified!
Implementation Hints:
Environment reproducibility:
# Cell 1: Version check
import sys
print(f"Python: {sys.version}")
import pandas as pd
print(f"Pandas: {pd.__version__}")
# Set random seed for reproducibility
import numpy as np
np.random.seed(42)
Generate requirements:
pip freeze > requirements.txt
# Or better, use pip-compile (pip-tools)
pip-compile requirements.in
Data caching:
import os
import requests
DATA_FILE = 'data/dataset.csv'
if not os.path.exists(DATA_FILE):
response = requests.get('https://api.example.com/data')
with open(DATA_FILE, 'w') as f:
f.write(response.text)
df = pd.read_csv(DATA_FILE)
Publication-quality figures:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6), dpi=300)
# ... plotting code ...
fig.savefig('figures/figure1.pdf', bbox_inches='tight')
fig.savefig('figures/figure1.png', dpi=300, bbox_inches='tight')
Learning milestones:
- Pin dependencies → Same package versions everywhere
- Control randomness → Reproducible random numbers
- Cache data → Don’t re-download every run
- Export figures → Publication-ready graphics
Project 4: Interactive Visualization Dashboard
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R (with Shiny in notebooks)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Visualization / Interactive Computing
- Software or Tool: ipywidgets, Plotly, Voilà
- Main Book: “Interactive Data Visualization for the Web” by Scott Murray
What you’ll build: An interactive dashboard with sliders, dropdowns, and buttons that update visualizations in real-time—all within a Jupyter notebook, deployable as a standalone web app with Voilà.
Why it teaches Jupyter: Jupyter’s widget system transforms notebooks from static documents into interactive applications. This project teaches the full power of ipywidgets and how to deploy notebooks as dashboards.
Core challenges you’ll face:
- Widget types → maps to sliders, dropdowns, text inputs, buttons
- Reactive updates → maps to observe and link mechanisms
- Layout design → maps to HBox, VBox, GridBox
- Voilà deployment → maps to notebooks as web apps
Resources for key challenges:
Key Concepts:
- Widget Basics: ipywidgets User Guide
- Linking Widgets: ipywidgets - Widget Events
- Layout Widgets: ipywidgets - Widget Layout
- Voilà Deployment: Voilà Documentation
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, matplotlib/plotly basics
Real world outcome:
Your dashboard displays:
┌─────────────────────────────────────────────────────────────────────┐
│ 📊 Sales Analytics Dashboard │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Date Range: [===========|=======] Jan 2024 - Dec 2024 │
│ │
│ Region: [Dropdown: All ▼] Product: [Dropdown: All ▼] │
│ │
│ ┌─────────────────────────────────┐ ┌────────────────────────────┐ │
│ │ │ │ │ │
│ │ Monthly Sales Trend │ │ Sales by Category │ │
│ │ (Line Chart - Updates │ │ (Pie Chart - Updates │ │
│ │ when filters change) │ │ when filters change) │ │
│ │ │ │ │ │
│ └─────────────────────────────────┘ └────────────────────────────┘ │
│ │
│ Total Sales: $1,234,567 YoY Growth: +15.3% Top Region: West │
│ │
│ [📥 Download Report] [🔄 Refresh Data] │
│ │
└─────────────────────────────────────────────────────────────────────┘
Deploy with Voilà:
$ voila dashboard.ipynb
→ Opens as standalone web app (code hidden, only widgets visible)
Implementation Hints:
Basic widget pattern:
import ipywidgets as widgets
from IPython.display import display
# Create widgets
slider = widgets.IntSlider(
value=50,
min=0,
max=100,
description='Filter:'
)
output = widgets.Output()
# Update function
def update(change):
with output:
output.clear_output()
filtered = df[df['value'] <= change['new']]
# Create updated plot
plt.figure()
filtered.plot(kind='bar')
plt.show()
slider.observe(update, names='value')
# Display
display(slider, output)
Layout widgets:
from ipywidgets import HBox, VBox, GridBox, Layout
dashboard = VBox([
HBox([date_slider, region_dropdown]),
HBox([
widgets.Output(layout=Layout(width='50%')),
widgets.Output(layout=Layout(width='50%'))
]),
HBox([download_button, refresh_button])
])
display(dashboard)
Deploy with Voilà:
# Install
pip install voila
# Run
voila dashboard.ipynb
# With custom template
voila dashboard.ipynb --template=material
Learning milestones:
- Use basic widgets → Sliders, dropdowns, buttons
- Link widgets to outputs → Reactive updates
- Design layouts → Professional-looking dashboard
- Deploy with Voilà → Share as web app
Project 5: Kernel Architecture Deep Dive
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: Any (for multi-kernel exploration)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The “Resume Gold”
- Difficulty: Level 3: Advanced
- Knowledge Area: Systems Programming / IPC
- Software or Tool: Jupyter, ZeroMQ, ipykernel
- Main Book: “The Linux Programming Interface” by Michael Kerrisk
What you’ll build: A minimal Jupyter kernel from scratch that executes code and returns output, teaching you exactly how the notebook-to-kernel communication works.
Why it teaches Jupyter: Most users treat the kernel as magic. Building one reveals the ZeroMQ messaging protocol, the separation between frontend and backend, and why notebooks can support any programming language.
Core challenges you’ll face:
- ZeroMQ messaging → maps to request-reply, publish-subscribe patterns
- Message protocol → maps to Jupyter wire protocol
- State management → maps to execution count, variable storage
- Output handling → maps to stdout, stderr, display data
Resources for key challenges:
Key Concepts:
- ZeroMQ Basics: “ZeroMQ Guide” by Pieter Hintjens
- Jupyter Wire Protocol: Jupyter Messaging Documentation
- IPython Kernel: ipykernel source code
- Process Communication: “The Linux Programming Interface” Ch. 43
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-4, understanding of sockets/IPC
Real world outcome:
Your minimal kernel:
# my_kernel.py
from ipykernel.kernelbase import Kernel
class MyKernel(Kernel):
implementation = 'My Kernel'
implementation_version = '1.0'
language = 'python'
language_version = '3.10'
banner = "My Custom Kernel"
def do_execute(self, code, silent, ...):
# Execute code
try:
result = eval(code)
if not silent:
stream_content = {'name': 'stdout', 'text': str(result)}
self.send_response(self.iopub_socket, 'stream', stream_content)
except Exception as e:
# Handle error
...
return {'status': 'ok', 'execution_count': self.execution_count}
# Install the kernel
$ python -m my_kernel install
# Use in Jupyter
$ jupyter notebook
# Select "My Kernel" from kernel menu
# See the messages flying
$ jupyter console --existing
Kernel responds to:
- execute_request
- kernel_info_request
- complete_request (autocomplete)
- inspect_request (documentation)
Implementation Hints:
The Jupyter architecture:
Frontend (Notebook)
│
│ ZeroMQ (tcp://...)
│
▼
┌──────────────────────────────────────────────────────┐
│ KERNEL PROCESS │
│ │
│ Shell Channel (execute, complete, inspect) │
│ IOPub Channel (stdout, stderr, display_data) │
│ Stdin Channel (input() requests) │
│ Control Channel (shutdown, interrupt) │
│ Heartbeat Channel (alive check) │
│ │
└──────────────────────────────────────────────────────┘
Message format:
{
'header': {
'msg_id': 'uuid',
'msg_type': 'execute_request',
'session': 'uuid',
},
'parent_header': {...},
'metadata': {...},
'content': {
'code': 'print("hello")',
'silent': False
}
}
Questions to explore:
- What happens when you press Shift+Enter?
- How does tab-completion work?
- How are plots displayed inline?
- What happens when you restart the kernel?
Learning milestones:
- Understand ZeroMQ sockets → Communication channels
- Parse Jupyter messages → Wire protocol
- Handle execute_request → Run code and return output
- Install custom kernel → Use in real notebooks
Project 6: Multi-Language Notebook
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python, R, Julia
- Alternative Programming Languages: SQL, Bash, JavaScript
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 2. The “Micro-SaaS / Pro Tool”
- Difficulty: Level 2: Intermediate
- Knowledge Area: Polyglot Programming / Data Pipelines
- Software or Tool: Jupyter, IRkernel, IJulia, BeakerX
- Main Book: “Data Science at the Command Line” by Jeroen Janssens
What you’ll build: A single notebook that uses multiple programming languages: Python for data processing, R for statistical analysis, SQL for database queries, and JavaScript for visualization—passing data between them.
Why it teaches Jupyter: The “Ju” in Jupyter stands for Julia, “Py” for Python, and “R” for R. This project teaches you the kernel architecture by using multiple kernels and passing data between languages.
Core challenges you’ll face:
- Multiple kernels → maps to installing and switching kernels
- Data passing → maps to rpy2, julia, or file-based
- Magic commands → maps to %%R, %%javascript, %%sql
- Environment setup → maps to managing multiple language environments
Resources for key challenges:
Key Concepts:
- Kernel Installation: Jupyter Documentation - Installing Kernels
- rpy2 Python-R Bridge: rpy2 Documentation
- SQL Magic: ipython-sql Documentation
- Polyglot Notebooks: BeakerX Documentation
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 5, basic R/SQL knowledge
Real world outcome:
# Cell 1: Python - Load data
import pandas as pd
df = pd.read_csv('sales.csv')
print(f"Loaded {len(df)} rows")
# Cell 2: SQL Magic - Query data
%load_ext sql
%sql sqlite:///sales.db
%%sql
SELECT region, SUM(sales) as total
FROM sales_table
GROUP BY region
ORDER BY total DESC
LIMIT 5
# Cell 3: R Magic - Statistical analysis
%load_ext rpy2.ipython
%%R -i df
# df is passed from Python!
model <- lm(sales ~ advertising + price, data=df)
summary(model)
# Cell 4: JavaScript - Interactive visualization
%%javascript
// Access data passed from Python
require(['d3'], function(d3) {
// Create D3 visualization
d3.select('#viz').append('svg')...
});
# Cell 5: Bash - System commands
%%bash
wc -l sales.csv
head -5 sales.csv
Implementation Hints:
Install additional kernels:
# R kernel
$ Rscript -e "install.packages('IRkernel')"
$ Rscript -e "IRkernel::installspec()"
# Julia kernel
$ julia -e 'using Pkg; Pkg.add("IJulia")'
# SQL magic
$ pip install ipython-sql
# R magic
$ pip install rpy2
Data passing between languages:
# Python to R with rpy2
%load_ext rpy2.ipython
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})
%%R -i df -o result
# df comes from Python
result <- summary(df)
# result goes back to Python
# Alternative: Use files
df.to_csv('temp.csv')
# Then read in R, Julia, etc.
Learning milestones:
- Install multiple kernels → R, Julia, SQL
- Use magic commands → %%R, %%sql, %%bash
- Pass data between languages → Python ↔ R
- Build polyglot pipeline → Best tool for each job
Project 7: Machine Learning Experiment Tracker
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R, Julia
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Machine Learning / MLOps
- Software or Tool: Jupyter, MLflow, Weights & Biases
- Main Book: “Hands-On Machine Learning” by Aurélien Géron
What you’ll build: A notebook-based ML experiment tracking system that logs hyperparameters, metrics, model artifacts, and visualizations—making ML experiments reproducible and comparable.
Why it teaches Jupyter: ML development is inherently iterative and exploratory—perfect for notebooks. This project teaches best practices for ML in notebooks: experiment tracking, model versioning, and result comparison.
Core challenges you’ll face:
- Experiment logging → maps to MLflow, W&B integration
- Hyperparameter tracking → maps to parameter logging
- Artifact storage → maps to models, plots, data versions
- Comparison dashboards → maps to comparing runs
Resources for key challenges:
Key Concepts:
- Experiment Tracking: “MLOps” by Mark Treveil
- Model Registry: MLflow Model Registry docs
- Hyperparameter Logging: W&B Experiment Tracking
- Artifact Management: MLflow Artifacts docs
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, basic ML knowledge
Real world outcome:
# In your notebook:
import mlflow
import mlflow.sklearn
# Start experiment
mlflow.set_experiment("sales-prediction")
with mlflow.start_run():
# Log parameters
mlflow.log_param("model", "RandomForest")
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
# Train model
model = RandomForestRegressor(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Log metrics
train_score = model.score(X_train, y_train)
test_score = model.score(X_test, y_test)
mlflow.log_metric("train_r2", train_score)
mlflow.log_metric("test_r2", test_score)
# Log model
mlflow.sklearn.log_model(model, "model")
# Log figure
fig, ax = plt.subplots()
ax.scatter(y_test, model.predict(X_test))
mlflow.log_figure(fig, "predictions.png")
# View all experiments:
$ mlflow ui
# Opens dashboard comparing all runs!
┌─────────────────────────────────────────────────────────────────────┐
│ MLflow Experiment Dashboard │
├─────────────────────────────────────────────────────────────────────┤
│ Run │ model │ n_estimators │ test_r2 │ Duration │
├─────────────────────────────────────────────────────────────────────┤
│ run_001 │ RandomForest │ 100 │ 0.85 │ 2m 34s │
│ run_002 │ GradientBoost │ 200 │ 0.87 │ 4m 12s │
│ run_003 │ RandomForest │ 500 │ 0.86 │ 8m 45s │
└─────────────────────────────────────────────────────────────────────┘
Implementation Hints:
MLflow setup:
# Install
pip install mlflow
# Start tracking server
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts
# Or use W&B (cloud-based)
pip install wandb
wandb login
Best practices for ML notebooks:
- One experiment per run → Clear boundaries
- Log everything → Parameters, metrics, artifacts
- Version data → DVC or artifact logging
- Save notebooks with outputs → Reproducibility
- Use tags → Easy filtering and search
Questions to explore:
- How do you compare two experiment runs?
- How do you load a model from a previous run?
- How do you version your training data?
- What should go in version control vs. artifact storage?
Learning milestones:
- Set up MLflow → Local tracking server
- Log first experiment → Parameters and metrics
- Compare runs → Find best hyperparameters
- Load previous model → Reproducibility
Project 8: Automated Report Generator
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R (with knitr)
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Automation / Reporting
- Software or Tool: Papermill, nbconvert, scheduling
- Main Book: “Automate the Boring Stuff with Python” by Al Sweigart
What you’ll build: A parameterized notebook that generates automated reports—daily sales reports, weekly metrics, monthly analyses—run via cron or scheduled tasks, outputting PDF/HTML reports.
Why it teaches Jupyter: Notebooks aren’t just for interactive work—they can be automated. Papermill lets you parameterize notebooks and run them programmatically, turning notebooks into report-generation pipelines.
Core challenges you’ll face:
- Parameterization → maps to Papermill parameters
- Programmatic execution → maps to nbconvert –execute
- Output formats → maps to PDF, HTML, slides
- Scheduling → maps to cron, Task Scheduler, Airflow
Resources for key challenges:
Key Concepts:
- Parameterized Notebooks: Papermill User Guide
- Headless Execution: nbconvert execute preprocessor
- Template Customization: nbconvert templates
- Scheduling: cron, Airflow documentation
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-3, command-line familiarity
Real world outcome:
# Template notebook: sales_report_template.ipynb
# Parameters cell (tagged with "parameters")
report_date = "2024-01-01" # Will be overwritten
region = "all" # Will be overwritten
# Analysis cells
df = load_data(start=report_date)
df = df[df['region'] == region] if region != "all" else df
# Visualizations...
# Summary statistics...
# Markdown conclusions...
# Run with Papermill:
$ papermill sales_report_template.ipynb \
reports/sales_2024-01-15_west.ipynb \
-p report_date "2024-01-15" \
-p region "west"
# Convert to PDF:
$ jupyter nbconvert --to pdf reports/sales_2024-01-15_west.ipynb
# Automate with cron (daily at 6 AM):
0 6 * * * cd /project && ./generate_reports.sh
# Result: Every morning, stakeholders receive:
┌─────────────────────────────────────────────────────────────────────┐
│ Daily Sales Report │
│ January 15, 2024 - West Region │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ Executive Summary │
│ ───────────────── │
│ Total Sales: $234,567 │
│ YoY Change: +12.3% │
│ │
│ [Chart: Daily Sales Trend] │
│ │
│ [Chart: Top Products] │
│ │
│ Key Insights: │
│ • Product A exceeded targets by 15% │
│ • Marketing campaign drove 20% traffic increase │
│ │
└─────────────────────────────────────────────────────────────────────┘
Implementation Hints:
Papermill workflow:
# 1. Create template with parameters cell
# Tag a cell with "parameters" in cell metadata
# 2. Run with papermill
import papermill as pm
pm.execute_notebook(
'template.ipynb',
'output.ipynb',
parameters={
'report_date': '2024-01-15',
'region': 'west'
}
)
# 3. Convert to PDF
from nbconvert import PDFExporter
import nbformat
nb = nbformat.read('output.ipynb', as_version=4)
pdf_exporter = PDFExporter()
pdf_data, resources = pdf_exporter.from_notebook_node(nb)
with open('report.pdf', 'wb') as f:
f.write(pdf_data)
Scheduling script:
#!/bin/bash
# generate_reports.sh
DATE=$(date +%Y-%m-%d)
for region in north south east west; do
papermill template.ipynb \
"reports/${DATE}_${region}.ipynb" \
-p report_date "$DATE" \
-p region "$region"
jupyter nbconvert --to pdf "reports/${DATE}_${region}.ipynb"
done
# Email reports
mutt -a reports/*.pdf -- stakeholders@company.com < email_body.txt
Learning milestones:
- Parameterize a notebook → Tag parameters cell
- Run with Papermill → Programmatic execution
- Convert to PDF → Professional output
- Schedule with cron → Fully automated
Project 9: JupyterLab Extension
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: TypeScript/JavaScript
- Alternative Programming Languages: Python (for backend)
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Frontend Development / Plugin Architecture
- Software or Tool: JupyterLab, Node.js, TypeScript
- Main Book: “Programming TypeScript” by Boris Cherny
What you’ll build: A JupyterLab extension that adds new functionality—a custom sidebar, new toolbar buttons, or a completely new view—teaching you the JupyterLab extension architecture.
Why it teaches Jupyter: JupyterLab is built as a collection of extensions. Understanding how to extend it teaches you the modular architecture, PhosphorJS widgets, and the JupyterLab plugin system.
Core challenges you’ll face:
- JupyterLab architecture → maps to plugins, services, widgets
- TypeScript/React → maps to modern frontend development
- Extension API → maps to commands, menus, sidebars
- Building and publishing → maps to npm, conda-forge
Resources for key challenges:
Key Concepts:
- Plugin System: JupyterLab Extension Guide
- Lumino Widgets: Lumino Documentation
- JupyterLab Services: @jupyterlab/services API
- Extension Publishing: jupyter-packaging docs
Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5, TypeScript/React experience
Real world outcome:
Your extension adds a "Code Snippets" sidebar:
┌─────────────────────────────────────────────────────────────────────┐
│ JupyterLab │
├──────────────┬──────────────────────────────────────────────────────┤
│ │ │
│ 📁 Files │ [Notebook] │
│ │ │
│ 🔧 Running │ [1]: import pandas as pd │
│ │ │
│ 📝 Snippets │ [2]: df = pd.read_csv('data.csv') │
│ ───────────│ │
│ > Data Load │ [3]: df.head() │
│ - CSV │ │
│ - JSON │ │
│ - SQL │ │
│ > Plotting │ │
│ - Line │ │
│ - Bar │ │
│ - Scatter │ │
│ > ML │ │
│ - Train │ │
│ - Eval │ │
│ │ │
│ [+ Add] │ │
│ │ │
└──────────────┴──────────────────────────────────────────────────────┘
Clicking a snippet inserts it into the current cell!
Implementation Hints:
Extension structure:
my-extension/
├── package.json # Dependencies, scripts
├── tsconfig.json # TypeScript config
├── src/
│ └── index.ts # Plugin entry point
├── style/
│ └── index.css # Styles
└── schema/
└── plugin.json # Settings schema
Basic plugin structure:
// src/index.ts
import { JupyterFrontEnd, JupyterFrontEndPlugin } from '@jupyterlab/application';
import { ICommandPalette } from '@jupyterlab/apputils';
const plugin: JupyterFrontEndPlugin<void> = {
id: 'my-extension:plugin',
autoStart: true,
requires: [ICommandPalette],
activate: (app: JupyterFrontEnd, palette: ICommandPalette) => {
console.log('Extension activated!');
// Add a command
app.commands.addCommand('my-extension:hello', {
label: 'Say Hello',
execute: () => {
alert('Hello from my extension!');
}
});
// Add to command palette
palette.addItem({
command: 'my-extension:hello',
category: 'My Extension'
});
}
};
export default plugin;
Build and install:
# Create from cookiecutter template
pip install cookiecutter
cookiecutter https://github.com/jupyterlab/extension-cookiecutter-ts
# Install dependencies
cd my-extension
npm install
# Build
npm run build
# Install in JupyterLab
pip install -e .
jupyter labextension develop . --overwrite
# Watch for changes during development
npm run watch
Learning milestones:
- Create from template → Cookiecutter setup
- Add simple command → Command palette integration
- Create sidebar widget → Lumino widgets
- Publish to npm → Share with community
Project 10: Real-Time Collaborative Notebook
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python (backend), TypeScript (frontend)
- Alternative Programming Languages: None
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Distributed Systems / Real-Time Collaboration
- Software or Tool: JupyterHub, jupyter-collaboration
- Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann
What you’ll build: A real-time collaborative notebook system where multiple users can edit the same notebook simultaneously—like Google Docs for code. Understand the CRDT algorithms that make this possible.
Why it teaches Jupyter: Real-time collaboration is the frontier of notebook technology. JupyterLab now supports this via CRDTs (Conflict-free Replicated Data Types). Understanding this teaches distributed systems concepts through a practical lens.
Core challenges you’ll face:
- JupyterHub setup → maps to multi-user deployment
- Real-time sync → maps to Y.js, CRDTs
- Conflict resolution → maps to operational transformation
- User presence → maps to cursors, selections
Resources for key challenges:
Key Concepts:
- CRDTs: “Designing Data-Intensive Applications” Ch. 5
- Operational Transformation: Google Docs engineering blog
- WebSocket Communication: MDN WebSocket documentation
- JupyterHub Architecture: JupyterHub Technical Overview
Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Projects 5 and 9, distributed systems basics
Real world outcome:
Two users editing the same notebook simultaneously:
┌─────────────────────────────┐ ┌─────────────────────────────┐
│ Alice's Browser │ │ Bob's Browser │
├─────────────────────────────┤ ├─────────────────────────────┤
│ │ │ │
│ [1]: x = 10 🟢Alice │ │ [1]: x = 10 │
│ y = 20 🔵Bob← │ │ y = 20 🔵Bob← │
│ │ │ │
│ [2]: # Analysis 🟢Alice← │ │ [2]: # Analysis 🟢Alice← │
│ Exploring... │ │ Exploring... │
│ │ │ │
│ ───────────────────────── │ │ ───────────────────────── │
│ 🟢 Alice (you) │ │ 🟢 Alice │
│ 🔵 Bob │ │ 🔵 Bob (you) │
│ │ │ │
└─────────────────────────────┘ └─────────────────────────────┘
↑ ↑
└────────────┬────────────────────┘
│
┌───────┴───────┐
│ Y.js WebSocket │
│ Server │
│ (CRDTs) │
└───────────────┘
Changes sync instantly:
- Alice types → Bob sees immediately
- Cursors show where each user is
- Conflicts resolved automatically
Implementation Hints:
JupyterHub setup:
# Install JupyterHub
pip install jupyterhub
npm install -g configurable-http-proxy
# Generate config
jupyterhub --generate-config
# Install collaboration extension
pip install jupyter-collaboration
# Start hub
jupyterhub
Understanding CRDTs:
Traditional sync: Client → Server → Conflict → Manual resolution
CRDT sync:
Client A → [State A]
Client B → [State B]
Merge → [State A ∪ B] (automatically consistent!)
Y.js implements:
- Y.Text: Collaborative text editing
- Y.Array: Collaborative lists
- Y.Map: Collaborative key-value
Notebook cells become Y.Array of Y.Map objects!
Key architecture:
┌──────────┐ ┌──────────┐
│ Client A │ │ Client B │
└────┬─────┘ └────┬─────┘
│ │
│ WebSocket │
▼ ▼
┌──────────────────────────┐
│ Y.js Provider │
│ (Awareness + Sync) │
├──────────────────────────┤
│ JupyterHub Server │
│ (Auth + Routing) │
└──────────────────────────┘
Questions to explore:
- What happens if two users edit the same cell?
- How are cursors transmitted between clients?
- What happens when a user goes offline and comes back?
- How is the notebook persisted to disk?
Learning milestones:
- Deploy JupyterHub → Multi-user setup
- Enable collaboration → RTC extension
- Understand CRDTs → How conflicts are resolved
- Observe sync → Debug WebSocket messages
Project 11: Notebook Testing Framework
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: None
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Testing / Quality Assurance
- Software or Tool: nbval, pytest, testbook
- Main Book: “Python Testing with pytest” by Brian Okken
What you’ll build: A testing framework for notebooks that validates outputs, checks for errors, and integrates with CI/CD pipelines—treating notebooks as testable artifacts.
Why it teaches Jupyter: Notebooks are often criticized for being “untestable.” This project teaches you to treat notebooks as first-class tested artifacts, essential for production notebook workflows.
Core challenges you’ll face:
- Output validation → maps to nbval expected outputs
- Cell execution testing → maps to testbook fixtures
- CI integration → maps to GitHub Actions, GitLab CI
- Regression testing → maps to detecting output changes
Resources for key challenges:
Key Concepts:
- Notebook Validation: nbval documentation
- Unit Testing Cells: testbook user guide
- CI/CD for Notebooks: GitHub Actions documentation
- Property-Based Testing: Hypothesis documentation
Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, pytest experience
Real world outcome:
# Test file: test_analysis.py
from testbook import testbook
@testbook('analysis.ipynb', execute=True)
def test_data_loading(tb):
"""Test that data loads correctly."""
df = tb.ref('df') # Get variable from notebook
assert len(df) > 0
assert 'sales' in df.columns
@testbook('analysis.ipynb')
def test_specific_cell(tb):
"""Test a specific cell's output."""
tb.execute_cell('data_cleaning') # Execute cell by tag
result = tb.cell_output_text('data_cleaning')
assert 'cleaned' in result.lower()
@testbook('analysis.ipynb')
def test_visualization(tb):
"""Test that visualization produces output."""
tb.execute_cell('plot')
assert tb.cell_output_type('plot') == 'display_data'
# Run with pytest:
$ pytest test_analysis.py -v
test_analysis.py::test_data_loading PASSED
test_analysis.py::test_specific_cell PASSED
test_analysis.py::test_visualization PASSED
# Or use nbval for output comparison:
$ pytest --nbval analysis.ipynb
analysis.ipynb::Cell 0 PASSED
analysis.ipynb::Cell 1 PASSED
analysis.ipynb::Cell 2 PASSED
...
# GitHub Actions workflow:
name: Test Notebooks
on: [push]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-python@v4
- run: pip install pytest testbook nbval
- run: pytest --nbval notebooks/
Implementation Hints:
nbval (output validation):
# Install
pip install nbval
# Run - checks that saved outputs match re-execution
pytest --nbval my_notebook.ipynb
# Sanitize outputs (ignore variable parts like timestamps)
pytest --nbval --nbval-sanitize-with sanitize.cfg my_notebook.ipynb
# sanitize.cfg:
# [regex]
# regex: \d{4}-\d{2}-\d{2}
# replace: DATE
testbook (unit testing):
from testbook import testbook
@testbook('notebook.ipynb')
def test_function(tb):
# Execute all cells
tb.execute()
# Or execute specific cells by index
tb.execute_cell([0, 1, 2])
# Or by tag
tb.execute_cell('setup')
# Get variables
my_var = tb.ref('my_var')
# Inject code
tb.inject("""
test_input = [1, 2, 3]
""")
# Call functions defined in notebook
func = tb.ref('my_function')
result = func(test_input)
assert result == expected
CI/CD best practices:
- Execute notebooks from scratch → No stale outputs
- Test with different data → Parameterized tests
- Check execution time → Performance regression
- Validate markdown → No broken links
- Lint notebooks → nbqa, pre-commit
Learning milestones:
- Validate outputs → nbval basics
- Unit test cells → testbook framework
- Set up CI → GitHub Actions
- Test data variations → Parameterized testing
Project 12: GPU-Accelerated Data Science
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: None (CUDA-specific)
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: High Performance Computing / GPU Programming
- Software or Tool: RAPIDS (cuDF, cuML), Google Colab
- Main Book: “Hands-On GPU Computing with Python” by Avimanyu Bandyopadhyay
What you’ll build: A notebook workflow that processes large datasets on GPUs using RAPIDS (cuDF, cuML), demonstrating 10-100x speedups over CPU-based Pandas/Scikit-learn.
Why it teaches Jupyter: Data science workloads are increasingly GPU-accelerated. Notebooks are the primary interface for GPU data science via RAPIDS and Google Colab. This project teaches you to leverage GPUs interactively.
Core challenges you’ll face:
- GPU memory management → maps to understanding VRAM limits
- cuDF vs Pandas → maps to API differences
- cuML vs Scikit-learn → maps to GPU-accelerated ML
- Colab/cloud GPUs → maps to accessing GPU resources
Resources for key challenges:
Key Concepts:
- cuDF Basics: RAPIDS cuDF User Guide
- GPU Memory: CUDA Programming Guide
- cuML Algorithms: RAPIDS cuML documentation
- Dask Integration: Dask-cuDF for larger-than-GPU data
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-3, understanding of GPU concepts
Real world outcome:
# Google Colab or local GPU environment
# Cell 1: Install RAPIDS
!pip install cudf-cu12 cuml-cu12 --extra-index-url=https://pypi.nvidia.org
# Cell 2: Compare Pandas vs cuDF
import pandas as pd
import cudf
import time
# Load 10 million rows
pandas_df = pd.read_csv('large_data.csv') # ~30 seconds
cudf_df = cudf.read_csv('large_data.csv') # ~2 seconds
# Benchmark operations
# Pandas
start = time.time()
pandas_df.groupby('category').agg({'value': 'mean'})
print(f"Pandas: {time.time() - start:.2f}s") # ~5 seconds
# cuDF (GPU)
start = time.time()
cudf_df.groupby('category').agg({'value': 'mean'})
print(f"cuDF: {time.time() - start:.2f}s") # ~0.05 seconds
# 100x speedup!
# Cell 3: GPU Machine Learning
from cuml import RandomForestClassifier as cuRF
from sklearn.ensemble import RandomForestClassifier as skRF
# Scikit-learn (CPU)
sk_model = skRF(n_estimators=100)
%time sk_model.fit(X_train, y_train) # ~2 minutes
# cuML (GPU)
cu_model = cuRF(n_estimators=100)
%time cu_model.fit(X_train, y_train) # ~5 seconds
# 24x speedup!
Implementation Hints:
Setting up GPU environment:
# Check GPU availability
!nvidia-smi
# Google Colab: Runtime → Change runtime type → GPU
# Local: Install RAPIDS (Linux only, or WSL2)
conda install -c rapidsai -c conda-forge -c nvidia \
rapids=24.02 python=3.10 cuda-version=12.0
cuDF API (mostly Pandas-compatible):
import cudf
# Read data
gdf = cudf.read_csv('data.csv')
gdf = cudf.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
# Operations (same as Pandas)
gdf['c'] = gdf['a'] + gdf['b']
grouped = gdf.groupby('a').mean()
filtered = gdf[gdf['a'] > 1]
# Convert to/from Pandas
pandas_df = gdf.to_pandas()
gdf = cudf.from_pandas(pandas_df)
GPU memory management:
import rmm # RAPIDS Memory Manager
# Check memory
!nvidia-smi --query-gpu=memory.used --format=csv
# Clear GPU memory
import gc
gc.collect()
rmm.reinitialize(pool_allocator=True)
Learning milestones:
- Access GPU → Colab or local setup
- Use cuDF → GPU-accelerated DataFrames
- Compare performance → Benchmark CPU vs GPU
- Train ML on GPU → cuML algorithms
Project 13: Notebook-to-Production Pipeline
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: None
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 3. The “Service & Support” Model
- Difficulty: Level 3: Advanced
- Knowledge Area: MLOps / Software Engineering
- Software or Tool: nbdev, Jupyter, pytest
- Main Book: “Building Machine Learning Pipelines” by Hannes Hapke
What you’ll build: A workflow that converts exploratory notebooks into production Python packages—extracting functions, adding tests, generating documentation—using nbdev or manual refactoring patterns.
Why it teaches Jupyter: The biggest criticism of notebooks is “notebook spaghetti”—code that can’t be productionized. This project teaches the discipline of writing production-ready code in notebooks using nbdev’s literate programming approach.
Core challenges you’ll face:
- Code extraction → maps to nbdev export, manual refactoring
- Test generation → maps to cells as tests
- Documentation → maps to docstrings, quarto
- Packaging → maps to setup.py, pyproject.toml
Resources for key challenges:
Key Concepts:
- Literate Programming: nbdev philosophy
- Python Packaging: Python Packaging User Guide
- Documentation Generation: Quarto documentation
- CI/CD for Packages: GitHub Actions
Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-8, Python packaging knowledge
Real world outcome:
Your notebook-based development:
notebooks/
├── 00_core.ipynb # Core library code
├── 01_data.ipynb # Data loading utilities
├── 02_models.ipynb # Model definitions
├── 03_training.ipynb # Training pipeline
└── index.ipynb # Documentation homepage
↓ nbdev_export ↓
my_package/
├── __init__.py
├── core.py # Extracted from 00_core.ipynb
├── data.py # Extracted from 01_data.ipynb
├── models.py # Extracted from 02_models.ipynb
└── training.py # Extracted from 03_training.ipynb
tests/
├── test_core.py # From test cells in 00_core.ipynb
└── test_data.py # From test cells in 01_data.ipynb
docs/
├── index.html # Generated from notebooks
├── core.html
└── ...
# Install as package
pip install -e .
# Use in production
from my_package import train_model
train_model(config)
Implementation Hints:
nbdev workflow:
# In notebook cell, mark for export:
#| export
def my_function(x):
"""This function will be exported."""
return x * 2
# Mark as test:
#| test
assert my_function(2) == 4
# Mark as documentation only (not exported):
#| echo: false
# This is explanation...
nbdev commands:
# Initialize nbdev
nbdev_new
# Export to Python modules
nbdev_export
# Run tests
nbdev_test
# Build documentation
nbdev_docs
# Prepare for release
nbdev_prepare
Manual refactoring pattern:
# 1. Identify reusable code in notebook
# 2. Extract to functions with docstrings
# 3. Move to .py files
# 4. Import in notebook for testing
# 5. Add unit tests
# 6. Package with setup.py
# Notebook becomes integration test:
from my_package import process_data, train_model
# Interactive development with imported code
df = process_data('data.csv')
model = train_model(df)
Learning milestones:
- Set up nbdev → Initialize project
- Export code → Cells to modules
- Generate tests → Test cells
- Build documentation → Quarto output
- Publish package → PyPI release
Project 14: Complete Data Science Platform
- File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
- Main Programming Language: Python
- Alternative Programming Languages: R, SQL
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 4. The “Open Core” Infrastructure
- Difficulty: Level 5: Master
- Knowledge Area: Platform Engineering / Full Stack
- Software or Tool: JupyterHub, Kubernetes, All previous projects
- Main Book: “Kubernetes in Action” by Marko Lukša
What you’ll build: A complete data science platform combining JupyterHub (multi-user), MLflow (experiments), Voilà (dashboards), and orchestration (Kubernetes/Docker)—a mini enterprise data science platform.
Why it teaches Jupyter: This capstone project integrates everything: notebooks for development, dashboards for stakeholders, experiment tracking for ML, and scalable infrastructure for teams.
Core challenges you’ll face:
- JupyterHub on Kubernetes → maps to Zero to JupyterHub
- Shared storage → maps to NFS, S3
- User management → maps to OAuth, LDAP
- Service integration → maps to MLflow, databases
Time estimate: 2-3 months Prerequisites: All previous projects, Kubernetes basics
Real world outcome:
Your platform architecture:
┌─────────────────────────────────────────────────────────────────────┐
│ DATA SCIENCE PLATFORM │
├─────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ User A │ │ User B │ │ User C │ │
│ │ (Notebook) │ │ (Notebook) │ │ (Notebook) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └─────────────────┼─────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ JupyterHub │ │
│ │ (Auth) │ │
│ └──────┬──────┘ │
│ │ │
│ ┌─────────────────┼─────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ MLflow │ │ Voilà │ │ Shared │ │
│ │ (Experiments)│ │ (Dashboards) │ │ Storage │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
│ │
│ Infrastructure: Kubernetes / Docker Compose │
│ │
└─────────────────────────────────────────────────────────────────────┘
Users experience:
1. Login with company credentials (OAuth)
2. Spawn personal Jupyter environment
3. Access shared data on NFS/S3
4. Track experiments in MLflow
5. Deploy dashboards with Voilà
6. Collaborate in real-time
Implementation Hints:
Docker Compose (development):
# docker-compose.yml
version: '3'
services:
jupyterhub:
image: jupyterhub/jupyterhub
ports:
- "8000:8000"
volumes:
- ./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py
- ./data:/data
environment:
- DOCKER_NETWORK_NAME=ds-platform
mlflow:
image: ghcr.io/mlflow/mlflow
ports:
- "5000:5000"
command: mlflow server --host 0.0.0.0
volumes:
- ./mlruns:/mlruns
voila:
image: voila/voila
ports:
- "8866:8866"
volumes:
- ./dashboards:/dashboards
command: voila --no-browser /dashboards
Kubernetes (production):
# Use Zero to JupyterHub
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm install jhub jupyterhub/jupyterhub --values config.yaml
Integration:
# In notebook, connect to platform services
# MLflow
import mlflow
mlflow.set_tracking_uri("http://mlflow:5000")
# Shared storage
import s3fs
fs = s3fs.S3FileSystem(anon=False)
df = pd.read_csv(fs.open('s3://shared-data/dataset.csv'))
# Database
from sqlalchemy import create_engine
engine = create_engine('postgresql://db:5432/analytics')
Learning milestones:
- Deploy JupyterHub → Multi-user Jupyter
- Add MLflow → Experiment tracking
- Configure storage → Shared data access
- Set up auth → OAuth/LDAP
- Add Voilà → Dashboard deployment
Project Comparison Table
| # | Project | Difficulty | Time | Key Skill | Fun |
|---|---|---|---|---|---|
| 1 | Interactive Data Explorer | ⭐ | Weekend | Cell Execution | ⭐⭐⭐ |
| 2 | Literate Programming Tutorial | ⭐ | 3-5 days | Markdown/LaTeX | ⭐⭐⭐ |
| 3 | Reproducible Research | ⭐⭐ | 1 week | Environment Management | ⭐⭐⭐ |
| 4 | Interactive Dashboard | ⭐⭐ | 1-2 weeks | ipywidgets | ⭐⭐⭐⭐ |
| 5 | Kernel Architecture | ⭐⭐⭐ | 2-3 weeks | ZeroMQ/IPC | ⭐⭐⭐⭐ |
| 6 | Multi-Language Notebook | ⭐⭐ | 1-2 weeks | Polyglot Programming | ⭐⭐⭐⭐ |
| 7 | ML Experiment Tracker | ⭐⭐ | 1-2 weeks | MLflow/W&B | ⭐⭐⭐⭐ |
| 8 | Automated Report Generator | ⭐⭐ | 1 week | Papermill | ⭐⭐⭐ |
| 9 | JupyterLab Extension | ⭐⭐⭐⭐ | 3-4 weeks | TypeScript/React | ⭐⭐⭐⭐ |
| 10 | Real-Time Collaboration | ⭐⭐⭐⭐ | 4-6 weeks | CRDTs/WebSocket | ⭐⭐⭐⭐⭐ |
| 11 | Notebook Testing | ⭐⭐ | 1-2 weeks | pytest/nbval | ⭐⭐⭐ |
| 12 | GPU-Accelerated DS | ⭐⭐⭐ | 2-3 weeks | RAPIDS/cuDF | ⭐⭐⭐⭐ |
| 13 | Notebook-to-Production | ⭐⭐⭐ | 2-3 weeks | nbdev | ⭐⭐⭐⭐ |
| 14 | Complete DS Platform | ⭐⭐⭐⭐⭐ | 2-3 months | Full Stack | ⭐⭐⭐⭐⭐ |
Recommended Learning Path
Phase 1: Fundamentals (1-2 weeks)
Understand why notebooks exist and basic usage:
- Project 1: Interactive Data Explorer - Core notebook skills
- Project 2: Literate Programming Tutorial - Documentation features
Phase 2: Professional Usage (2-3 weeks)
Learn to use notebooks professionally:
- Project 3: Reproducible Research - Environment management
- Project 4: Interactive Dashboard - Widgets and interactivity
- Project 8: Automated Report Generator - Parameterized notebooks
Phase 3: Data Science Workflows (2-3 weeks)
Apply notebooks to data science:
- Project 6: Multi-Language Notebook - Polyglot programming
- Project 7: ML Experiment Tracker - MLOps integration
- Project 11: Notebook Testing - Quality assurance
Phase 4: Advanced Architecture (3-4 weeks)
Understand how notebooks work:
- Project 5: Kernel Architecture - Under the hood
- Project 9: JupyterLab Extension - Extending Jupyter
Phase 5: Production & Scale (4-8 weeks)
Enterprise-grade notebook workflows:
- Project 12: GPU-Accelerated DS - High-performance computing
- Project 13: Notebook-to-Production - Code extraction
- Project 10: Real-Time Collaboration - Multi-user editing
- Project 14: Complete DS Platform - Full infrastructure
Final Project: End-to-End Data Science Workflow
Following the same pattern as above, this capstone applies everything:
What you’ll build: A complete end-to-end data science workflow in notebooks:
- Data ingestion from multiple sources (APIs, databases, files)
- Exploratory data analysis with rich visualizations
- Feature engineering with documentation
- Model development with experiment tracking
- Interactive dashboard for stakeholders
- Automated reporting pipeline
- Production package extraction
- CI/CD for the entire workflow
This final project demonstrates:
- When to use notebooks vs. pure code
- Professional notebook organization
- Reproducibility at every step
- From exploration to production
Summary
| # | Project | Main Language |
|---|---|---|
| 1 | Interactive Data Explorer | Python |
| 2 | Literate Programming Tutorial | Python/Markdown |
| 3 | Reproducible Research Document | Python |
| 4 | Interactive Visualization Dashboard | Python |
| 5 | Kernel Architecture Deep Dive | Python |
| 6 | Multi-Language Notebook | Python/R/Julia |
| 7 | Machine Learning Experiment Tracker | Python |
| 8 | Automated Report Generator | Python |
| 9 | JupyterLab Extension | TypeScript |
| 10 | Real-Time Collaborative Notebook | Python/TypeScript |
| 11 | Notebook Testing Framework | Python |
| 12 | GPU-Accelerated Data Science | Python |
| 13 | Notebook-to-Production Pipeline | Python |
| 14 | Complete Data Science Platform | Python |
Resources
Essential Books
- “Python for Data Analysis” by Wes McKinney - Pandas creator, covers notebooks
- “Hands-On Machine Learning” by Aurélien Géron - Uses notebooks throughout
- “Data Science at the Command Line” by Jeroen Janssens - Alternative perspective
Tools
- Jupyter Notebook: https://jupyter.org/ - Classic interface
- JupyterLab: https://jupyterlab.readthedocs.io/ - Modern interface
- Google Colab: https://colab.research.google.com/ - Free GPU notebooks
- VSCode Jupyter: https://code.visualstudio.com/docs/datascience/jupyter-notebooks
- nbdev: https://nbdev.fast.ai/ - Notebooks to packages
- Voilà: https://voila.readthedocs.io/ - Notebooks to dashboards
Documentation
Practice Platforms
- Kaggle Notebooks: https://www.kaggle.com/code - Data science competitions
- Observable: https://observablehq.com/ - JavaScript notebooks
- Deepnote: https://deepnote.com/ - Collaborative notebooks
Total Estimated Time: 4-6 months of dedicated study
After completion: You’ll understand exactly why notebooks exist, when to use them (and when not to), how to build professional data science workflows, and how the entire Jupyter architecture works from browser to kernel. You’ll be able to build interactive dashboards, automate reports, track experiments, and even extend JupyterLab with custom functionality.