← Back to all projects

LEARN JUPYTER NOTEBOOKS DEEP DIVE

Learn Jupyter Notebooks: From Zero to Interactive Computing Master

Goal: Deeply understand Jupyter Notebooks—from basic usage and why they exist, to building interactive data science workflows, visualization dashboards, reproducible research documents, and understanding the underlying kernel architecture.


Why Learn Jupyter Notebooks?

Jupyter Notebooks represent a paradigm shift in how we write, test, and share code. Unlike traditional code files (.py, .js, .c), Jupyter Notebooks are interactive documents that blend code, output, visualizations, and narrative text in a single shareable artifact.

The Problem with Pure Code Files

Traditional programming workflow:

Write code → Run entire file → See output → Modify code → Run again

This creates friction for:

  • Exploration: You want to test one idea quickly, but must run the whole file
  • Visualization: Plots appear in separate windows, not alongside your code
  • Documentation: Comments are separate from rendered explanations
  • Sharing: Colleagues see code, but not the results unless they run it themselves
  • Teaching: Students can’t see the thought process, only the final result

The Notebook Solution

Jupyter Notebooks workflow:

Write cell → Run cell → See output immediately → Continue experimenting → Share document with code AND results

Why People Choose Notebooks Over Pure Code

Aspect Pure Code Files Jupyter Notebooks
Execution Run entire file Run cells individually
Feedback Output in terminal Output inline with code
Visualization Separate windows Embedded in document
Documentation Comments only Markdown, LaTeX, images
Sharing Code only Code + outputs + narrative
Exploration Restart for each change Iterate on state
Reproducibility Re-run everything Clear + Run All

Who Uses Jupyter Notebooks?

  • Data Scientists: 90%+ use notebooks for exploratory data analysis
  • Machine Learning Engineers: Model prototyping and experimentation
  • Scientists/Researchers: Reproducible research documents
  • Educators: Interactive teaching materials
  • Analysts: Reports that combine code, data, and insights
  • Engineers: API testing, prototyping, documentation

After completing these projects, you will:

  • Understand why notebooks exist and when to use them
  • Master interactive data exploration and visualization
  • Build reproducible research documents
  • Create interactive dashboards and widgets
  • Understand the kernel architecture (how code actually runs)
  • Know when notebooks are NOT the right tool
  • Share your work professionally

Core Concept Analysis

The Notebook Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                         BROWSER / JUPYTER LAB                            │
│                                                                          │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                      Notebook Interface                           │   │
│  │                                                                   │   │
│  │  ┌─────────────────────────────────────────────────────────────┐ │   │
│  │  │ [Markdown Cell]                                             │ │   │
│  │  │ # My Analysis                                               │ │   │
│  │  │ This notebook explores...                                   │ │   │
│  │  └─────────────────────────────────────────────────────────────┘ │   │
│  │  ┌─────────────────────────────────────────────────────────────┐ │   │
│  │  │ [Code Cell]                                       [Run ▶]   │ │   │
│  │  │ import pandas as pd                                         │ │   │
│  │  │ df = pd.read_csv('data.csv')                               │ │   │
│  │  │ df.head()                                                   │ │   │
│  │  └─────────────────────────────────────────────────────────────┘ │   │
│  │  ┌─────────────────────────────────────────────────────────────┐ │   │
│  │  │ [Output]                                                    │ │   │
│  │  │    name    age    salary                                    │ │   │
│  │  │ 0  Alice   28     50000                                     │ │   │
│  │  │ 1  Bob     35     65000                                     │ │   │
│  │  │ 2  Carol   42     75000                                     │ │   │
│  │  └─────────────────────────────────────────────────────────────┘ │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    │ WebSocket (ZMQ)
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                           JUPYTER SERVER                                 │
│                                                                          │
│   ┌────────────────┐    ┌────────────────┐    ┌────────────────┐       │
│   │  Python Kernel │    │   R Kernel     │    │  Julia Kernel  │       │
│   │                │    │                │    │                │       │
│   │  Executes code │    │  Executes code │    │  Executes code │       │
│   │  Maintains     │    │  Maintains     │    │  Maintains     │       │
│   │  state (vars)  │    │  state (vars)  │    │  state (vars)  │       │
│   └────────────────┘    └────────────────┘    └────────────────┘       │
└─────────────────────────────────────────────────────────────────────────┘

Key Concepts Explained

1. Cells: The Building Blocks

┌─────────────────────────────────────────────────────────────────────┐
│                           CELL TYPES                                 │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  CODE CELL                    │  MARKDOWN CELL                      │
│  ─────────────────           │  ───────────────────                 │
│  Contains executable code     │  Contains formatted text            │
│  Runs in the kernel          │  Rendered as HTML                   │
│  Shows output below          │  Supports LaTeX math                │
│                              │  Supports images, links             │
│  Example:                    │  Example:                           │
│  x = 10                      │  # Section Title                    │
│  print(x * 2)                │  This is **bold** text             │
│  → 20                        │  Formula: $E = mc^2$               │
│                              │                                      │
└─────────────────────────────────────────────────────────────────────┘

2. Kernel: The Execution Engine

The kernel is a separate process that executes your code. Key properties:

┌─────────────────────────────────────────────────────────────────────┐
│                           KERNEL STATE                               │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  # Cell 1: Executed first                                           │
│  x = 10                                                             │
│                                                                      │
│  # Cell 2: Executed second                                          │
│  y = x + 5  # x is still in memory!                                │
│                                                                      │
│  # Cell 3: Executed third                                           │
│  print(y)  → 15                                                     │
│                                                                      │
│  ═══════════════════════════════════════════════════════════════   │
│                                                                      │
│  KERNEL MEMORY:                                                      │
│  ┌──────────────────────────────────────────────────────────────┐   │
│  │  x → 10                                                       │   │
│  │  y → 15                                                       │   │
│  │  (All variables persist until kernel restart)                │   │
│  └──────────────────────────────────────────────────────────────┘   │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Critical Understanding: Execution order matters, not cell position!

Cell [1]: x = 5      # Executed 1st
Cell [3]: z = x + y  # Executed 3rd (uses values from 1st and 2nd)
Cell [2]: y = 10     # Executed 2nd

The numbers in brackets show execution order, not position.

3. The .ipynb File Format

Notebooks are stored as JSON files:

{
  "cells": [
    {
      "cell_type": "markdown",
      "source": ["# My Notebook\n", "This is text"]
    },
    {
      "cell_type": "code",
      "source": ["x = 10\n", "print(x)"],
      "outputs": [
        {
          "output_type": "stream",
          "text": ["10\n"]
        }
      ],
      "execution_count": 1
    }
  ],
  "metadata": {
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  }
}

This format enables:

  • Git diff: See what changed (though it’s noisy)
  • Output storage: Results saved with code
  • Metadata: Kernel info, widgets state

4. Jupyter Ecosystem

┌─────────────────────────────────────────────────────────────────────┐
│                       JUPYTER ECOSYSTEM                              │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  INTERFACES:                                                         │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │
│  │  Jupyter    │  │  JupyterLab │  │   VSCode    │                  │
│  │  Notebook   │  │  (Modern)   │  │  Extension  │                  │
│  │  (Classic)  │  │             │  │             │                  │
│  └─────────────┘  └─────────────┘  └─────────────┘                  │
│                                                                      │
│  KERNELS:                                                            │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │
│  │   Python    │  │      R      │  │    Julia    │                  │
│  │  (ipykernel)│  │ (IRkernel)  │  │  (IJulia)   │                  │
│  └─────────────┘  └─────────────┘  └─────────────┘                  │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                  │
│  │   Scala     │  │   JavaScript│  │   C++       │                  │
│  │ (Almond)    │  │   (IJavaSc) │  │  (xeus-cling│                  │
│  └─────────────┘  └─────────────┘  └─────────────┘                  │
│                                                                      │
│  EXTENSIONS:                                                         │
│  • nbextensions (Notebook)    • jupyterlab-git                      │
│  • widgets (ipywidgets)       • jupyterlab-code-formatter          │
│  • voila (dashboards)         • variable-inspector                  │
│                                                                      │
│  CLOUD PLATFORMS:                                                    │
│  • Google Colab               • Kaggle Notebooks                    │
│  • AWS SageMaker              • Azure Notebooks                     │
│  • Binder                     • Deepnote                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

5. When NOT to Use Notebooks

Notebooks are powerful but not always appropriate:

Use Notebooks For Use Pure Code For
Exploration Production systems
Prototyping Libraries/packages
Teaching CLI tools
Reports Large applications
Visualization Version-controlled code
Quick experiments Team collaboration

Project List

The following 14 projects will teach you Jupyter Notebooks from basics to advanced interactive computing.


Project 1: Interactive Data Explorer

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Julia
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Data Exploration / Pandas
  • Software or Tool: Jupyter Notebook, Pandas, Matplotlib
  • Main Book: “Python for Data Analysis” by Wes McKinney

What you’ll build: An interactive data exploration notebook that loads a dataset (CSV, JSON, or Excel), performs summary statistics, handles missing values, creates visualizations, and documents findings—all in one shareable document.

Why it teaches Jupyter: This is the core use case for notebooks. You’ll immediately understand why data scientists prefer notebooks over plain Python scripts: see your data transformations instantly, iterate on visualizations, and create a document that tells a story.

Core challenges you’ll face:

  • Cell execution order → maps to understanding kernel state and variable persistence
  • Inline visualization → maps to matplotlib integration with %matplotlib inline
  • DataFrame display → maps to rich output representation
  • Mixing code and narrative → maps to markdown cells for documentation

Resources for key challenges:

Key Concepts:

  • Cell Execution: Jupyter Documentation - Running Code
  • Magic Commands: IPython Documentation - Built-in Magics
  • Inline Plotting: Matplotlib Documentation - Interactive Mode
  • DataFrame Display: Pandas Documentation - Options and Settings

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic Python, understanding of CSV files

Real world outcome:

Your notebook will display:

# Sales Data Analysis - Q4 2024

## 1. Data Loading
[Code] df = pd.read_csv('sales.csv')
       df.head()

[Output] Beautiful formatted table showing first 5 rows

## 2. Summary Statistics
[Code] df.describe()

[Output] Count, mean, std, min, max for all numeric columns

## 3. Missing Values Analysis
[Code] df.isnull().sum()

[Output] Column-by-column missing value counts

## 4. Visualization
[Code] plt.figure(figsize=(12, 6))
       df.plot(kind='bar')

[Output] Bar chart embedded directly in the notebook

## 5. Key Findings
[Markdown] Based on the analysis:
- Sales peaked in November
- Region X underperformed by 15%
- 3 outliers identified for investigation

Implementation Hints:

Start by understanding the notebook interface:

  1. Create a new notebook (Kernel → Python 3)
  2. First cell: Import libraries with %matplotlib inline magic
  3. Second cell: Load data with pandas
  4. Use Shift+Enter to run cells and move to next
  5. Use markdown cells (change cell type) for documentation

Key questions to explore:

  • What happens if you run cell 3 before cell 2?
  • What is the [*] that appears while a cell is running?
  • How do you restart the kernel and clear all outputs?
  • What’s the difference between df (displays nicely) and print(df) (plain text)?

Magic commands to learn:

%matplotlib inline      # Plots appear in notebook
%timeit expression      # Time how long code takes
%who                    # List all variables
%reset                  # Clear all variables
%%time                  # Time the entire cell

Learning milestones:

  1. Run your first cell → Understand cell execution
  2. Create inline plot → See why notebooks beat scripts
  3. Mix code and markdown → Build a narrative document
  4. Share the .ipynb file → Others see your code AND results

Project 2: Literate Programming Tutorial

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python (with Markdown)
  • Alternative Programming Languages: Any with Jupyter kernel
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Documentation / Technical Writing
  • Software or Tool: Jupyter Notebook, nbconvert
  • Main Book: “The Pragmatic Programmer” by Hunt and Thomas

What you’ll build: A complete programming tutorial that teaches a concept (like sorting algorithms or regex) using a notebook—combining explanations, code examples, exercises, and solutions.

Why it teaches Jupyter: Notebooks originated from the idea of “literate programming” (Donald Knuth): code should be embedded in documentation, not documentation embedded in code. Building a tutorial forces you to use all of Jupyter’s documentation features.

Core challenges you’ll face:

  • Markdown mastery → maps to headers, lists, emphasis, code blocks
  • LaTeX for math → maps to inline and block equations
  • Exercise design → maps to code cells for practice
  • Export to HTML/PDF → maps to nbconvert for sharing

Resources for key challenges:

Key Concepts:

  • Literate Programming: “Literate Programming” by Donald Knuth
  • Markdown Syntax: GitHub Markdown Guide
  • LaTeX Math: Overleaf LaTeX Documentation
  • nbconvert: Jupyter Documentation - Converting Notebooks

Difficulty: Beginner Time estimate: 3-5 days Prerequisites: Project 1 (Interactive Data Explorer)

Real world outcome:

Your tutorial notebook will contain:

# Understanding Sorting Algorithms

## Introduction
Sorting is fundamental to computer science. We'll explore three algorithms:
- Bubble Sort: O(n²) - Simple but slow
- Merge Sort: O(n log n) - Divide and conquer
- Quick Sort: O(n log n) average - Fast in practice

## Big O Notation
The time complexity is expressed as:

$$T(n) = O(f(n))$$

Where $f(n)$ describes how time grows with input size $n$.

## Bubble Sort
### Theory
Compare adjacent elements and swap if out of order...

### Implementation
[Code Cell - Students fill in]

### Visualization
[Code Cell - Animated sorting visualization]

## Exercises
1. Implement bubble sort
2. Count comparisons
3. Compare with built-in `sorted()`

## Solutions (Hidden by default)
[Code Cell - Solutions]

Implementation Hints:

Markdown essentials:

# Heading 1
## Heading 2

**bold** and *italic*

- Bullet list
1. Numbered list

`inline code````python
code block
​```

[Link text](url)
![Alt text](image.png)

> Blockquote

---  (horizontal rule)

LaTeX math:

Inline: $E = mc^2$

Block:
$$
\sum_{i=1}^{n} i = \frac{n(n+1)}{2}
$$

Export your tutorial:

# Convert to HTML
jupyter nbconvert --to html tutorial.ipynb

# Convert to PDF (requires LaTeX)
jupyter nbconvert --to pdf tutorial.ipynb

# Convert to slides
jupyter nbconvert --to slides tutorial.ipynb --post serve

Learning milestones:

  1. Master markdown → Headers, lists, emphasis
  2. Add math equations → LaTeX integration
  3. Include images → Visual explanations
  4. Export to HTML → Share with non-Jupyter users

Project 3: Reproducible Research Document

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Julia
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Scientific Computing / Reproducibility
  • Software or Tool: Jupyter, Papermill, requirements.txt
  • Main Book: “Reproducible Research with R and RStudio” by Christopher Gandrud

What you’ll build: A complete reproducible analysis that downloads data from an API, performs statistical analysis, generates publication-quality figures, and can be re-run by anyone to verify results.

Why it teaches Jupyter: The “reproducibility crisis” in science stems from analyses that can’t be replicated. This project teaches you to build notebooks that anyone can run and get the same results—essential for research and professional analysis.

Core challenges you’ll face:

  • Environment management → maps to conda/pip requirements
  • Random seed control → maps to reproducible random numbers
  • API data fetching → maps to handling external data
  • Figure export → maps to publication-ready graphics

Resources for key challenges:

Key Concepts:

  • Virtual Environments: Python Packaging Guide
  • Random Seeds: NumPy random documentation
  • Data Caching: Requests-cache library
  • Figure Export: Matplotlib savefig documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-2, understanding of statistics basics

Real world outcome:

Repository structure:

project/
├── README.md
│   "How to reproduce this analysis"
├── requirements.txt
│   pandas==2.0.0
│   matplotlib==3.7.0
│   requests==2.28.0
├── environment.yml
│   (conda environment)
├── data/
│   └── .gitkeep
│   (data downloaded on run)
├── figures/
│   └── figure1.png
│   └── figure2.pdf
├── analysis.ipynb
│   (your reproducible notebook)
└── run_analysis.sh
    "pip install -r requirements.txt"
    "jupyter nbconvert --execute analysis.ipynb"

When others run your notebook:
$ git clone your-repo
$ pip install -r requirements.txt
$ jupyter nbconvert --execute analysis.ipynb
# → Same results, same figures, verified!

Implementation Hints:

Environment reproducibility:

# Cell 1: Version check
import sys
print(f"Python: {sys.version}")

import pandas as pd
print(f"Pandas: {pd.__version__}")

# Set random seed for reproducibility
import numpy as np
np.random.seed(42)

Generate requirements:

pip freeze > requirements.txt
# Or better, use pip-compile (pip-tools)
pip-compile requirements.in

Data caching:

import os
import requests

DATA_FILE = 'data/dataset.csv'

if not os.path.exists(DATA_FILE):
    response = requests.get('https://api.example.com/data')
    with open(DATA_FILE, 'w') as f:
        f.write(response.text)

df = pd.read_csv(DATA_FILE)

Publication-quality figures:

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(10, 6), dpi=300)
# ... plotting code ...
fig.savefig('figures/figure1.pdf', bbox_inches='tight')
fig.savefig('figures/figure1.png', dpi=300, bbox_inches='tight')

Learning milestones:

  1. Pin dependencies → Same package versions everywhere
  2. Control randomness → Reproducible random numbers
  3. Cache data → Don’t re-download every run
  4. Export figures → Publication-ready graphics

Project 4: Interactive Visualization Dashboard

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R (with Shiny in notebooks)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Visualization / Interactive Computing
  • Software or Tool: ipywidgets, Plotly, Voilà
  • Main Book: “Interactive Data Visualization for the Web” by Scott Murray

What you’ll build: An interactive dashboard with sliders, dropdowns, and buttons that update visualizations in real-time—all within a Jupyter notebook, deployable as a standalone web app with Voilà.

Why it teaches Jupyter: Jupyter’s widget system transforms notebooks from static documents into interactive applications. This project teaches the full power of ipywidgets and how to deploy notebooks as dashboards.

Core challenges you’ll face:

  • Widget types → maps to sliders, dropdowns, text inputs, buttons
  • Reactive updates → maps to observe and link mechanisms
  • Layout design → maps to HBox, VBox, GridBox
  • Voilà deployment → maps to notebooks as web apps

Resources for key challenges:

Key Concepts:

  • Widget Basics: ipywidgets User Guide
  • Linking Widgets: ipywidgets - Widget Events
  • Layout Widgets: ipywidgets - Widget Layout
  • Voilà Deployment: Voilà Documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, matplotlib/plotly basics

Real world outcome:

Your dashboard displays:

┌─────────────────────────────────────────────────────────────────────┐
│                    📊 Sales Analytics Dashboard                      │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  Date Range: [===========|=======] Jan 2024 - Dec 2024              │
│                                                                      │
│  Region: [Dropdown: All ▼]    Product: [Dropdown: All ▼]            │
│                                                                      │
│  ┌─────────────────────────────────┐ ┌────────────────────────────┐ │
│  │                                 │ │                            │ │
│  │     Monthly Sales Trend         │ │    Sales by Category      │ │
│  │     (Line Chart - Updates       │ │    (Pie Chart - Updates   │ │
│  │      when filters change)       │ │     when filters change)  │ │
│  │                                 │ │                            │ │
│  └─────────────────────────────────┘ └────────────────────────────┘ │
│                                                                      │
│  Total Sales: $1,234,567    YoY Growth: +15.3%    Top Region: West  │
│                                                                      │
│  [📥 Download Report]  [🔄 Refresh Data]                            │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Deploy with Voilà:
$ voila dashboard.ipynb
→ Opens as standalone web app (code hidden, only widgets visible)

Implementation Hints:

Basic widget pattern:

import ipywidgets as widgets
from IPython.display import display

# Create widgets
slider = widgets.IntSlider(
    value=50,
    min=0,
    max=100,
    description='Filter:'
)

output = widgets.Output()

# Update function
def update(change):
    with output:
        output.clear_output()
        filtered = df[df['value'] <= change['new']]
        # Create updated plot
        plt.figure()
        filtered.plot(kind='bar')
        plt.show()

slider.observe(update, names='value')

# Display
display(slider, output)

Layout widgets:

from ipywidgets import HBox, VBox, GridBox, Layout

dashboard = VBox([
    HBox([date_slider, region_dropdown]),
    HBox([
        widgets.Output(layout=Layout(width='50%')),
        widgets.Output(layout=Layout(width='50%'))
    ]),
    HBox([download_button, refresh_button])
])

display(dashboard)

Deploy with Voilà:

# Install
pip install voila

# Run
voila dashboard.ipynb

# With custom template
voila dashboard.ipynb --template=material

Learning milestones:

  1. Use basic widgets → Sliders, dropdowns, buttons
  2. Link widgets to outputs → Reactive updates
  3. Design layouts → Professional-looking dashboard
  4. Deploy with Voilà → Share as web app

Project 5: Kernel Architecture Deep Dive

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Any (for multi-kernel exploration)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Systems Programming / IPC
  • Software or Tool: Jupyter, ZeroMQ, ipykernel
  • Main Book: “The Linux Programming Interface” by Michael Kerrisk

What you’ll build: A minimal Jupyter kernel from scratch that executes code and returns output, teaching you exactly how the notebook-to-kernel communication works.

Why it teaches Jupyter: Most users treat the kernel as magic. Building one reveals the ZeroMQ messaging protocol, the separation between frontend and backend, and why notebooks can support any programming language.

Core challenges you’ll face:

  • ZeroMQ messaging → maps to request-reply, publish-subscribe patterns
  • Message protocol → maps to Jupyter wire protocol
  • State management → maps to execution count, variable storage
  • Output handling → maps to stdout, stderr, display data

Resources for key challenges:

Key Concepts:

  • ZeroMQ Basics: “ZeroMQ Guide” by Pieter Hintjens
  • Jupyter Wire Protocol: Jupyter Messaging Documentation
  • IPython Kernel: ipykernel source code
  • Process Communication: “The Linux Programming Interface” Ch. 43

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-4, understanding of sockets/IPC

Real world outcome:

Your minimal kernel:

# my_kernel.py
from ipykernel.kernelbase import Kernel

class MyKernel(Kernel):
    implementation = 'My Kernel'
    implementation_version = '1.0'
    language = 'python'
    language_version = '3.10'
    banner = "My Custom Kernel"
    
    def do_execute(self, code, silent, ...):
        # Execute code
        try:
            result = eval(code)
            if not silent:
                stream_content = {'name': 'stdout', 'text': str(result)}
                self.send_response(self.iopub_socket, 'stream', stream_content)
        except Exception as e:
            # Handle error
            ...
        
        return {'status': 'ok', 'execution_count': self.execution_count}

# Install the kernel
$ python -m my_kernel install

# Use in Jupyter
$ jupyter notebook
# Select "My Kernel" from kernel menu

# See the messages flying
$ jupyter console --existing
Kernel responds to:
- execute_request
- kernel_info_request
- complete_request (autocomplete)
- inspect_request (documentation)

Implementation Hints:

The Jupyter architecture:

Frontend (Notebook)
       │
       │ ZeroMQ (tcp://...)
       │
       ▼
┌──────────────────────────────────────────────────────┐
│                   KERNEL PROCESS                      │
│                                                       │
│  Shell Channel (execute, complete, inspect)          │
│  IOPub Channel (stdout, stderr, display_data)        │
│  Stdin Channel (input() requests)                    │
│  Control Channel (shutdown, interrupt)               │
│  Heartbeat Channel (alive check)                     │
│                                                       │
└──────────────────────────────────────────────────────┘

Message format:

{
    'header': {
        'msg_id': 'uuid',
        'msg_type': 'execute_request',
        'session': 'uuid',
    },
    'parent_header': {...},
    'metadata': {...},
    'content': {
        'code': 'print("hello")',
        'silent': False
    }
}

Questions to explore:

  • What happens when you press Shift+Enter?
  • How does tab-completion work?
  • How are plots displayed inline?
  • What happens when you restart the kernel?

Learning milestones:

  1. Understand ZeroMQ sockets → Communication channels
  2. Parse Jupyter messages → Wire protocol
  3. Handle execute_request → Run code and return output
  4. Install custom kernel → Use in real notebooks

Project 6: Multi-Language Notebook

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python, R, Julia
  • Alternative Programming Languages: SQL, Bash, JavaScript
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Polyglot Programming / Data Pipelines
  • Software or Tool: Jupyter, IRkernel, IJulia, BeakerX
  • Main Book: “Data Science at the Command Line” by Jeroen Janssens

What you’ll build: A single notebook that uses multiple programming languages: Python for data processing, R for statistical analysis, SQL for database queries, and JavaScript for visualization—passing data between them.

Why it teaches Jupyter: The “Ju” in Jupyter stands for Julia, “Py” for Python, and “R” for R. This project teaches you the kernel architecture by using multiple kernels and passing data between languages.

Core challenges you’ll face:

  • Multiple kernels → maps to installing and switching kernels
  • Data passing → maps to rpy2, julia, or file-based
  • Magic commands → maps to %%R, %%javascript, %%sql
  • Environment setup → maps to managing multiple language environments

Resources for key challenges:

Key Concepts:

  • Kernel Installation: Jupyter Documentation - Installing Kernels
  • rpy2 Python-R Bridge: rpy2 Documentation
  • SQL Magic: ipython-sql Documentation
  • Polyglot Notebooks: BeakerX Documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Project 5, basic R/SQL knowledge

Real world outcome:

# Cell 1: Python - Load data
import pandas as pd
df = pd.read_csv('sales.csv')
print(f"Loaded {len(df)} rows")

# Cell 2: SQL Magic - Query data
%load_ext sql
%sql sqlite:///sales.db

%%sql
SELECT region, SUM(sales) as total
FROM sales_table
GROUP BY region
ORDER BY total DESC
LIMIT 5

# Cell 3: R Magic - Statistical analysis
%load_ext rpy2.ipython

%%R -i df
# df is passed from Python!
model <- lm(sales ~ advertising + price, data=df)
summary(model)

# Cell 4: JavaScript - Interactive visualization
%%javascript
// Access data passed from Python
require(['d3'], function(d3) {
    // Create D3 visualization
    d3.select('#viz').append('svg')...
});

# Cell 5: Bash - System commands
%%bash
wc -l sales.csv
head -5 sales.csv

Implementation Hints:

Install additional kernels:

# R kernel
$ Rscript -e "install.packages('IRkernel')"
$ Rscript -e "IRkernel::installspec()"

# Julia kernel
$ julia -e 'using Pkg; Pkg.add("IJulia")'

# SQL magic
$ pip install ipython-sql

# R magic
$ pip install rpy2

Data passing between languages:

# Python to R with rpy2
%load_ext rpy2.ipython
df = pd.DataFrame({'x': [1,2,3], 'y': [4,5,6]})

%%R -i df -o result
# df comes from Python
result <- summary(df)
# result goes back to Python

# Alternative: Use files
df.to_csv('temp.csv')
# Then read in R, Julia, etc.

Learning milestones:

  1. Install multiple kernels → R, Julia, SQL
  2. Use magic commands → %%R, %%sql, %%bash
  3. Pass data between languages → Python ↔ R
  4. Build polyglot pipeline → Best tool for each job

Project 7: Machine Learning Experiment Tracker

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, Julia
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Machine Learning / MLOps
  • Software or Tool: Jupyter, MLflow, Weights & Biases
  • Main Book: “Hands-On Machine Learning” by Aurélien Géron

What you’ll build: A notebook-based ML experiment tracking system that logs hyperparameters, metrics, model artifacts, and visualizations—making ML experiments reproducible and comparable.

Why it teaches Jupyter: ML development is inherently iterative and exploratory—perfect for notebooks. This project teaches best practices for ML in notebooks: experiment tracking, model versioning, and result comparison.

Core challenges you’ll face:

  • Experiment logging → maps to MLflow, W&B integration
  • Hyperparameter tracking → maps to parameter logging
  • Artifact storage → maps to models, plots, data versions
  • Comparison dashboards → maps to comparing runs

Resources for key challenges:

Key Concepts:

  • Experiment Tracking: “MLOps” by Mark Treveil
  • Model Registry: MLflow Model Registry docs
  • Hyperparameter Logging: W&B Experiment Tracking
  • Artifact Management: MLflow Artifacts docs

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, basic ML knowledge

Real world outcome:

# In your notebook:

import mlflow
import mlflow.sklearn

# Start experiment
mlflow.set_experiment("sales-prediction")

with mlflow.start_run():
    # Log parameters
    mlflow.log_param("model", "RandomForest")
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    
    # Train model
    model = RandomForestRegressor(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)
    
    # Log metrics
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    mlflow.log_metric("train_r2", train_score)
    mlflow.log_metric("test_r2", test_score)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log figure
    fig, ax = plt.subplots()
    ax.scatter(y_test, model.predict(X_test))
    mlflow.log_figure(fig, "predictions.png")

# View all experiments:
$ mlflow ui
# Opens dashboard comparing all runs!

┌─────────────────────────────────────────────────────────────────────┐
                    MLflow Experiment Dashboard                       
├─────────────────────────────────────────────────────────────────────┤
 Run       model          n_estimators  test_r2   Duration      
├─────────────────────────────────────────────────────────────────────┤
 run_001   RandomForest   100           0.85      2m 34s        
 run_002   GradientBoost  200           0.87      4m 12s        
 run_003   RandomForest   500           0.86      8m 45s        
└─────────────────────────────────────────────────────────────────────┘

Implementation Hints:

MLflow setup:

# Install
pip install mlflow

# Start tracking server
mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts

# Or use W&B (cloud-based)
pip install wandb
wandb login

Best practices for ML notebooks:

  1. One experiment per run → Clear boundaries
  2. Log everything → Parameters, metrics, artifacts
  3. Version data → DVC or artifact logging
  4. Save notebooks with outputs → Reproducibility
  5. Use tags → Easy filtering and search

Questions to explore:

  • How do you compare two experiment runs?
  • How do you load a model from a previous run?
  • How do you version your training data?
  • What should go in version control vs. artifact storage?

Learning milestones:

  1. Set up MLflow → Local tracking server
  2. Log first experiment → Parameters and metrics
  3. Compare runs → Find best hyperparameters
  4. Load previous model → Reproducibility

Project 8: Automated Report Generator

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R (with knitr)
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Automation / Reporting
  • Software or Tool: Papermill, nbconvert, scheduling
  • Main Book: “Automate the Boring Stuff with Python” by Al Sweigart

What you’ll build: A parameterized notebook that generates automated reports—daily sales reports, weekly metrics, monthly analyses—run via cron or scheduled tasks, outputting PDF/HTML reports.

Why it teaches Jupyter: Notebooks aren’t just for interactive work—they can be automated. Papermill lets you parameterize notebooks and run them programmatically, turning notebooks into report-generation pipelines.

Core challenges you’ll face:

  • Parameterization → maps to Papermill parameters
  • Programmatic execution → maps to nbconvert –execute
  • Output formats → maps to PDF, HTML, slides
  • Scheduling → maps to cron, Task Scheduler, Airflow

Resources for key challenges:

Key Concepts:

  • Parameterized Notebooks: Papermill User Guide
  • Headless Execution: nbconvert execute preprocessor
  • Template Customization: nbconvert templates
  • Scheduling: cron, Airflow documentation

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Projects 1-3, command-line familiarity

Real world outcome:

# Template notebook: sales_report_template.ipynb

# Parameters cell (tagged with "parameters")
report_date = "2024-01-01"  # Will be overwritten
region = "all"              # Will be overwritten

# Analysis cells
df = load_data(start=report_date)
df = df[df['region'] == region] if region != "all" else df

# Visualizations...
# Summary statistics...
# Markdown conclusions...

# Run with Papermill:
$ papermill sales_report_template.ipynb \
    reports/sales_2024-01-15_west.ipynb \
    -p report_date "2024-01-15" \
    -p region "west"

# Convert to PDF:
$ jupyter nbconvert --to pdf reports/sales_2024-01-15_west.ipynb

# Automate with cron (daily at 6 AM):
0 6 * * * cd /project && ./generate_reports.sh

# Result: Every morning, stakeholders receive:
┌─────────────────────────────────────────────────────────────────────┐
                    Daily Sales Report                                
                    January 15, 2024 - West Region                    
├─────────────────────────────────────────────────────────────────────┤
                                                                      
  Executive Summary                                                   
  ─────────────────                                                  
  Total Sales: $234,567                                              
  YoY Change: +12.3%                                                 
                                                                      
  [Chart: Daily Sales Trend]                                         
                                                                      
  [Chart: Top Products]                                              
                                                                      
  Key Insights:                                                       
   Product A exceeded targets by 15%                                
   Marketing campaign drove 20% traffic increase                    
                                                                      
└─────────────────────────────────────────────────────────────────────┘

Implementation Hints:

Papermill workflow:

# 1. Create template with parameters cell
# Tag a cell with "parameters" in cell metadata

# 2. Run with papermill
import papermill as pm

pm.execute_notebook(
    'template.ipynb',
    'output.ipynb',
    parameters={
        'report_date': '2024-01-15',
        'region': 'west'
    }
)

# 3. Convert to PDF
from nbconvert import PDFExporter
import nbformat

nb = nbformat.read('output.ipynb', as_version=4)
pdf_exporter = PDFExporter()
pdf_data, resources = pdf_exporter.from_notebook_node(nb)

with open('report.pdf', 'wb') as f:
    f.write(pdf_data)

Scheduling script:

#!/bin/bash
# generate_reports.sh

DATE=$(date +%Y-%m-%d)

for region in north south east west; do
    papermill template.ipynb \
        "reports/${DATE}_${region}.ipynb" \
        -p report_date "$DATE" \
        -p region "$region"
    
    jupyter nbconvert --to pdf "reports/${DATE}_${region}.ipynb"
done

# Email reports
mutt -a reports/*.pdf -- stakeholders@company.com < email_body.txt

Learning milestones:

  1. Parameterize a notebook → Tag parameters cell
  2. Run with Papermill → Programmatic execution
  3. Convert to PDF → Professional output
  4. Schedule with cron → Fully automated

Project 9: JupyterLab Extension

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: TypeScript/JavaScript
  • Alternative Programming Languages: Python (for backend)
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Frontend Development / Plugin Architecture
  • Software or Tool: JupyterLab, Node.js, TypeScript
  • Main Book: “Programming TypeScript” by Boris Cherny

What you’ll build: A JupyterLab extension that adds new functionality—a custom sidebar, new toolbar buttons, or a completely new view—teaching you the JupyterLab extension architecture.

Why it teaches Jupyter: JupyterLab is built as a collection of extensions. Understanding how to extend it teaches you the modular architecture, PhosphorJS widgets, and the JupyterLab plugin system.

Core challenges you’ll face:

  • JupyterLab architecture → maps to plugins, services, widgets
  • TypeScript/React → maps to modern frontend development
  • Extension API → maps to commands, menus, sidebars
  • Building and publishing → maps to npm, conda-forge

Resources for key challenges:

Key Concepts:

  • Plugin System: JupyterLab Extension Guide
  • Lumino Widgets: Lumino Documentation
  • JupyterLab Services: @jupyterlab/services API
  • Extension Publishing: jupyter-packaging docs

Difficulty: Expert Time estimate: 3-4 weeks Prerequisites: Projects 1-5, TypeScript/React experience

Real world outcome:

Your extension adds a "Code Snippets" sidebar:

┌─────────────────────────────────────────────────────────────────────┐
│  JupyterLab                                                          │
├──────────────┬──────────────────────────────────────────────────────┤
│              │                                                       │
│  📁 Files    │   [Notebook]                                         │
│              │                                                       │
│  🔧 Running  │   [1]: import pandas as pd                           │
│              │                                                       │
│  📝 Snippets │   [2]: df = pd.read_csv('data.csv')                  │
│  ───────────│                                                       │
│  > Data Load │   [3]: df.head()                                     │
│    - CSV     │                                                       │
│    - JSON    │                                                       │
│    - SQL     │                                                       │
│  > Plotting  │                                                       │
│    - Line    │                                                       │
│    - Bar     │                                                       │
│    - Scatter │                                                       │
│  > ML        │                                                       │
│    - Train   │                                                       │
│    - Eval    │                                                       │
│              │                                                       │
│  [+ Add]     │                                                       │
│              │                                                       │
└──────────────┴──────────────────────────────────────────────────────┘

Clicking a snippet inserts it into the current cell!

Implementation Hints:

Extension structure:

my-extension/
├── package.json          # Dependencies, scripts
├── tsconfig.json         # TypeScript config
├── src/
│   └── index.ts          # Plugin entry point
├── style/
│   └── index.css         # Styles
└── schema/
    └── plugin.json       # Settings schema

Basic plugin structure:

// src/index.ts
import { JupyterFrontEnd, JupyterFrontEndPlugin } from '@jupyterlab/application';
import { ICommandPalette } from '@jupyterlab/apputils';

const plugin: JupyterFrontEndPlugin<void> = {
  id: 'my-extension:plugin',
  autoStart: true,
  requires: [ICommandPalette],
  activate: (app: JupyterFrontEnd, palette: ICommandPalette) => {
    console.log('Extension activated!');
    
    // Add a command
    app.commands.addCommand('my-extension:hello', {
      label: 'Say Hello',
      execute: () => {
        alert('Hello from my extension!');
      }
    });
    
    // Add to command palette
    palette.addItem({
      command: 'my-extension:hello',
      category: 'My Extension'
    });
  }
};

export default plugin;

Build and install:

# Create from cookiecutter template
pip install cookiecutter
cookiecutter https://github.com/jupyterlab/extension-cookiecutter-ts

# Install dependencies
cd my-extension
npm install

# Build
npm run build

# Install in JupyterLab
pip install -e .
jupyter labextension develop . --overwrite

# Watch for changes during development
npm run watch

Learning milestones:

  1. Create from template → Cookiecutter setup
  2. Add simple command → Command palette integration
  3. Create sidebar widget → Lumino widgets
  4. Publish to npm → Share with community

Project 10: Real-Time Collaborative Notebook

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python (backend), TypeScript (frontend)
  • Alternative Programming Languages: None
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Distributed Systems / Real-Time Collaboration
  • Software or Tool: JupyterHub, jupyter-collaboration
  • Main Book: “Designing Data-Intensive Applications” by Martin Kleppmann

What you’ll build: A real-time collaborative notebook system where multiple users can edit the same notebook simultaneously—like Google Docs for code. Understand the CRDT algorithms that make this possible.

Why it teaches Jupyter: Real-time collaboration is the frontier of notebook technology. JupyterLab now supports this via CRDTs (Conflict-free Replicated Data Types). Understanding this teaches distributed systems concepts through a practical lens.

Core challenges you’ll face:

  • JupyterHub setup → maps to multi-user deployment
  • Real-time sync → maps to Y.js, CRDTs
  • Conflict resolution → maps to operational transformation
  • User presence → maps to cursors, selections

Resources for key challenges:

Key Concepts:

  • CRDTs: “Designing Data-Intensive Applications” Ch. 5
  • Operational Transformation: Google Docs engineering blog
  • WebSocket Communication: MDN WebSocket documentation
  • JupyterHub Architecture: JupyterHub Technical Overview

Difficulty: Expert Time estimate: 4-6 weeks Prerequisites: Projects 5 and 9, distributed systems basics

Real world outcome:

Two users editing the same notebook simultaneously:

┌─────────────────────────────┐     ┌─────────────────────────────┐
│  Alice's Browser            │     │  Bob's Browser              │
├─────────────────────────────┤     ├─────────────────────────────┤
│                             │     │                             │
│  [1]: x = 10    🟢Alice     │     │  [1]: x = 10                │
│       y = 20    🔵Bob←      │     │       y = 20    🔵Bob←      │
│                             │     │                             │
│  [2]: # Analysis  🟢Alice←  │     │  [2]: # Analysis  🟢Alice←  │
│       Exploring...          │     │       Exploring...          │
│                             │     │                             │
│  ───────────────────────── │     │  ───────────────────────── │
│  🟢 Alice (you)             │     │  🟢 Alice                   │
│  🔵 Bob                     │     │  🔵 Bob (you)               │
│                             │     │                             │
└─────────────────────────────┘     └─────────────────────────────┘
                ↑                                 ↑
                └────────────┬────────────────────┘
                             │
                     ┌───────┴───────┐
                     │ Y.js WebSocket │
                     │    Server      │
                     │   (CRDTs)      │
                     └───────────────┘

Changes sync instantly:
- Alice types → Bob sees immediately
- Cursors show where each user is
- Conflicts resolved automatically

Implementation Hints:

JupyterHub setup:

# Install JupyterHub
pip install jupyterhub
npm install -g configurable-http-proxy

# Generate config
jupyterhub --generate-config

# Install collaboration extension
pip install jupyter-collaboration

# Start hub
jupyterhub

Understanding CRDTs:

Traditional sync: Client → Server → Conflict → Manual resolution
CRDT sync: 
  Client A → [State A]
  Client B → [State B]
  Merge → [State A ∪ B] (automatically consistent!)

Y.js implements:
- Y.Text: Collaborative text editing
- Y.Array: Collaborative lists
- Y.Map: Collaborative key-value

Notebook cells become Y.Array of Y.Map objects!

Key architecture:

┌──────────┐    ┌──────────┐
│ Client A │    │ Client B │
└────┬─────┘    └────┬─────┘
     │               │
     │  WebSocket    │
     ▼               ▼
┌──────────────────────────┐
│    Y.js Provider         │
│    (Awareness + Sync)    │
├──────────────────────────┤
│    JupyterHub Server     │
│    (Auth + Routing)      │
└──────────────────────────┘

Questions to explore:

  • What happens if two users edit the same cell?
  • How are cursors transmitted between clients?
  • What happens when a user goes offline and comes back?
  • How is the notebook persisted to disk?

Learning milestones:

  1. Deploy JupyterHub → Multi-user setup
  2. Enable collaboration → RTC extension
  3. Understand CRDTs → How conflicts are resolved
  4. Observe sync → Debug WebSocket messages

Project 11: Notebook Testing Framework

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: None
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Testing / Quality Assurance
  • Software or Tool: nbval, pytest, testbook
  • Main Book: “Python Testing with pytest” by Brian Okken

What you’ll build: A testing framework for notebooks that validates outputs, checks for errors, and integrates with CI/CD pipelines—treating notebooks as testable artifacts.

Why it teaches Jupyter: Notebooks are often criticized for being “untestable.” This project teaches you to treat notebooks as first-class tested artifacts, essential for production notebook workflows.

Core challenges you’ll face:

  • Output validation → maps to nbval expected outputs
  • Cell execution testing → maps to testbook fixtures
  • CI integration → maps to GitHub Actions, GitLab CI
  • Regression testing → maps to detecting output changes

Resources for key challenges:

Key Concepts:

  • Notebook Validation: nbval documentation
  • Unit Testing Cells: testbook user guide
  • CI/CD for Notebooks: GitHub Actions documentation
  • Property-Based Testing: Hypothesis documentation

Difficulty: Intermediate Time estimate: 1-2 weeks Prerequisites: Projects 1-3, pytest experience

Real world outcome:

# Test file: test_analysis.py

from testbook import testbook

@testbook('analysis.ipynb', execute=True)
def test_data_loading(tb):
    """Test that data loads correctly."""
    df = tb.ref('df')  # Get variable from notebook
    assert len(df) > 0
    assert 'sales' in df.columns

@testbook('analysis.ipynb')
def test_specific_cell(tb):
    """Test a specific cell's output."""
    tb.execute_cell('data_cleaning')  # Execute cell by tag
    result = tb.cell_output_text('data_cleaning')
    assert 'cleaned' in result.lower()

@testbook('analysis.ipynb')
def test_visualization(tb):
    """Test that visualization produces output."""
    tb.execute_cell('plot')
    assert tb.cell_output_type('plot') == 'display_data'

# Run with pytest:
$ pytest test_analysis.py -v

test_analysis.py::test_data_loading PASSED
test_analysis.py::test_specific_cell PASSED  
test_analysis.py::test_visualization PASSED

# Or use nbval for output comparison:
$ pytest --nbval analysis.ipynb

analysis.ipynb::Cell 0 PASSED
analysis.ipynb::Cell 1 PASSED
analysis.ipynb::Cell 2 PASSED
...

# GitHub Actions workflow:
name: Test Notebooks
on: [push]
jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-python@v4
      - run: pip install pytest testbook nbval
      - run: pytest --nbval notebooks/

Implementation Hints:

nbval (output validation):

# Install
pip install nbval

# Run - checks that saved outputs match re-execution
pytest --nbval my_notebook.ipynb

# Sanitize outputs (ignore variable parts like timestamps)
pytest --nbval --nbval-sanitize-with sanitize.cfg my_notebook.ipynb

# sanitize.cfg:
# [regex]
# regex: \d{4}-\d{2}-\d{2}
# replace: DATE

testbook (unit testing):

from testbook import testbook

@testbook('notebook.ipynb')
def test_function(tb):
    # Execute all cells
    tb.execute()
    
    # Or execute specific cells by index
    tb.execute_cell([0, 1, 2])
    
    # Or by tag
    tb.execute_cell('setup')
    
    # Get variables
    my_var = tb.ref('my_var')
    
    # Inject code
    tb.inject("""
        test_input = [1, 2, 3]
    """)
    
    # Call functions defined in notebook
    func = tb.ref('my_function')
    result = func(test_input)
    assert result == expected

CI/CD best practices:

  1. Execute notebooks from scratch → No stale outputs
  2. Test with different data → Parameterized tests
  3. Check execution time → Performance regression
  4. Validate markdown → No broken links
  5. Lint notebooks → nbqa, pre-commit

Learning milestones:

  1. Validate outputs → nbval basics
  2. Unit test cells → testbook framework
  3. Set up CI → GitHub Actions
  4. Test data variations → Parameterized testing

Project 12: GPU-Accelerated Data Science

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: None (CUDA-specific)
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: High Performance Computing / GPU Programming
  • Software or Tool: RAPIDS (cuDF, cuML), Google Colab
  • Main Book: “Hands-On GPU Computing with Python” by Avimanyu Bandyopadhyay

What you’ll build: A notebook workflow that processes large datasets on GPUs using RAPIDS (cuDF, cuML), demonstrating 10-100x speedups over CPU-based Pandas/Scikit-learn.

Why it teaches Jupyter: Data science workloads are increasingly GPU-accelerated. Notebooks are the primary interface for GPU data science via RAPIDS and Google Colab. This project teaches you to leverage GPUs interactively.

Core challenges you’ll face:

  • GPU memory management → maps to understanding VRAM limits
  • cuDF vs Pandas → maps to API differences
  • cuML vs Scikit-learn → maps to GPU-accelerated ML
  • Colab/cloud GPUs → maps to accessing GPU resources

Resources for key challenges:

Key Concepts:

  • cuDF Basics: RAPIDS cuDF User Guide
  • GPU Memory: CUDA Programming Guide
  • cuML Algorithms: RAPIDS cuML documentation
  • Dask Integration: Dask-cuDF for larger-than-GPU data

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-3, understanding of GPU concepts

Real world outcome:

# Google Colab or local GPU environment

# Cell 1: Install RAPIDS
!pip install cudf-cu12 cuml-cu12 --extra-index-url=https://pypi.nvidia.org

# Cell 2: Compare Pandas vs cuDF
import pandas as pd
import cudf
import time

# Load 10 million rows
pandas_df = pd.read_csv('large_data.csv')  # ~30 seconds
cudf_df = cudf.read_csv('large_data.csv')  # ~2 seconds

# Benchmark operations
# Pandas
start = time.time()
pandas_df.groupby('category').agg({'value': 'mean'})
print(f"Pandas: {time.time() - start:.2f}s")  # ~5 seconds

# cuDF (GPU)
start = time.time()
cudf_df.groupby('category').agg({'value': 'mean'})
print(f"cuDF: {time.time() - start:.2f}s")    # ~0.05 seconds

# 100x speedup!

# Cell 3: GPU Machine Learning
from cuml import RandomForestClassifier as cuRF
from sklearn.ensemble import RandomForestClassifier as skRF

# Scikit-learn (CPU)
sk_model = skRF(n_estimators=100)
%time sk_model.fit(X_train, y_train)  # ~2 minutes

# cuML (GPU)
cu_model = cuRF(n_estimators=100)
%time cu_model.fit(X_train, y_train)  # ~5 seconds

# 24x speedup!

Implementation Hints:

Setting up GPU environment:

# Check GPU availability
!nvidia-smi

# Google Colab: Runtime → Change runtime type → GPU

# Local: Install RAPIDS (Linux only, or WSL2)
conda install -c rapidsai -c conda-forge -c nvidia \
    rapids=24.02 python=3.10 cuda-version=12.0

cuDF API (mostly Pandas-compatible):

import cudf

# Read data
gdf = cudf.read_csv('data.csv')
gdf = cudf.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})

# Operations (same as Pandas)
gdf['c'] = gdf['a'] + gdf['b']
grouped = gdf.groupby('a').mean()
filtered = gdf[gdf['a'] > 1]

# Convert to/from Pandas
pandas_df = gdf.to_pandas()
gdf = cudf.from_pandas(pandas_df)

GPU memory management:

import rmm  # RAPIDS Memory Manager

# Check memory
!nvidia-smi --query-gpu=memory.used --format=csv

# Clear GPU memory
import gc
gc.collect()
rmm.reinitialize(pool_allocator=True)

Learning milestones:

  1. Access GPU → Colab or local setup
  2. Use cuDF → GPU-accelerated DataFrames
  3. Compare performance → Benchmark CPU vs GPU
  4. Train ML on GPU → cuML algorithms

Project 13: Notebook-to-Production Pipeline

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: None
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: MLOps / Software Engineering
  • Software or Tool: nbdev, Jupyter, pytest
  • Main Book: “Building Machine Learning Pipelines” by Hannes Hapke

What you’ll build: A workflow that converts exploratory notebooks into production Python packages—extracting functions, adding tests, generating documentation—using nbdev or manual refactoring patterns.

Why it teaches Jupyter: The biggest criticism of notebooks is “notebook spaghetti”—code that can’t be productionized. This project teaches the discipline of writing production-ready code in notebooks using nbdev’s literate programming approach.

Core challenges you’ll face:

  • Code extraction → maps to nbdev export, manual refactoring
  • Test generation → maps to cells as tests
  • Documentation → maps to docstrings, quarto
  • Packaging → maps to setup.py, pyproject.toml

Resources for key challenges:

Key Concepts:

  • Literate Programming: nbdev philosophy
  • Python Packaging: Python Packaging User Guide
  • Documentation Generation: Quarto documentation
  • CI/CD for Packages: GitHub Actions

Difficulty: Advanced Time estimate: 2-3 weeks Prerequisites: Projects 1-8, Python packaging knowledge

Real world outcome:

Your notebook-based development:

notebooks/
├── 00_core.ipynb       # Core library code
├── 01_data.ipynb       # Data loading utilities
├── 02_models.ipynb     # Model definitions
├── 03_training.ipynb   # Training pipeline
└── index.ipynb         # Documentation homepage

↓ nbdev_export ↓

my_package/
├── __init__.py
├── core.py             # Extracted from 00_core.ipynb
├── data.py             # Extracted from 01_data.ipynb
├── models.py           # Extracted from 02_models.ipynb
└── training.py         # Extracted from 03_training.ipynb

tests/
├── test_core.py        # From test cells in 00_core.ipynb
└── test_data.py        # From test cells in 01_data.ipynb

docs/
├── index.html          # Generated from notebooks
├── core.html
└── ...

# Install as package
pip install -e .

# Use in production
from my_package import train_model
train_model(config)

Implementation Hints:

nbdev workflow:

# In notebook cell, mark for export:
#| export
def my_function(x):
    """This function will be exported."""
    return x * 2

# Mark as test:
#| test
assert my_function(2) == 4

# Mark as documentation only (not exported):
#| echo: false
# This is explanation...

nbdev commands:

# Initialize nbdev
nbdev_new

# Export to Python modules
nbdev_export

# Run tests
nbdev_test

# Build documentation
nbdev_docs

# Prepare for release
nbdev_prepare

Manual refactoring pattern:

# 1. Identify reusable code in notebook
# 2. Extract to functions with docstrings
# 3. Move to .py files
# 4. Import in notebook for testing
# 5. Add unit tests
# 6. Package with setup.py

# Notebook becomes integration test:
from my_package import process_data, train_model

# Interactive development with imported code
df = process_data('data.csv')
model = train_model(df)

Learning milestones:

  1. Set up nbdev → Initialize project
  2. Export code → Cells to modules
  3. Generate tests → Test cells
  4. Build documentation → Quarto output
  5. Publish package → PyPI release

Project 14: Complete Data Science Platform

  • File: LEARN_JUPYTER_NOTEBOOKS_DEEP_DIVE.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R, SQL
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 5: Master
  • Knowledge Area: Platform Engineering / Full Stack
  • Software or Tool: JupyterHub, Kubernetes, All previous projects
  • Main Book: “Kubernetes in Action” by Marko Lukša

What you’ll build: A complete data science platform combining JupyterHub (multi-user), MLflow (experiments), Voilà (dashboards), and orchestration (Kubernetes/Docker)—a mini enterprise data science platform.

Why it teaches Jupyter: This capstone project integrates everything: notebooks for development, dashboards for stakeholders, experiment tracking for ML, and scalable infrastructure for teams.

Core challenges you’ll face:

  • JupyterHub on Kubernetes → maps to Zero to JupyterHub
  • Shared storage → maps to NFS, S3
  • User management → maps to OAuth, LDAP
  • Service integration → maps to MLflow, databases

Time estimate: 2-3 months Prerequisites: All previous projects, Kubernetes basics

Real world outcome:

Your platform architecture:

┌─────────────────────────────────────────────────────────────────────┐
│                        DATA SCIENCE PLATFORM                         │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │   User A     │  │   User B     │  │   User C     │              │
│  │  (Notebook)  │  │  (Notebook)  │  │  (Notebook)  │              │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘              │
│         │                 │                 │                        │
│         └─────────────────┼─────────────────┘                        │
│                           │                                          │
│                    ┌──────┴──────┐                                   │
│                    │ JupyterHub  │                                   │
│                    │   (Auth)    │                                   │
│                    └──────┬──────┘                                   │
│                           │                                          │
│         ┌─────────────────┼─────────────────┐                        │
│         ▼                 ▼                 ▼                        │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐              │
│  │    MLflow    │  │    Voilà     │  │   Shared     │              │
│  │ (Experiments)│  │ (Dashboards) │  │   Storage    │              │
│  └──────────────┘  └──────────────┘  └──────────────┘              │
│                                                                      │
│  Infrastructure: Kubernetes / Docker Compose                        │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘

Users experience:
1. Login with company credentials (OAuth)
2. Spawn personal Jupyter environment
3. Access shared data on NFS/S3
4. Track experiments in MLflow
5. Deploy dashboards with Voilà
6. Collaborate in real-time

Implementation Hints:

Docker Compose (development):

# docker-compose.yml
version: '3'
services:
  jupyterhub:
    image: jupyterhub/jupyterhub
    ports:
      - "8000:8000"
    volumes:
      - ./jupyterhub_config.py:/srv/jupyterhub/jupyterhub_config.py
      - ./data:/data
    environment:
      - DOCKER_NETWORK_NAME=ds-platform
  
  mlflow:
    image: ghcr.io/mlflow/mlflow
    ports:
      - "5000:5000"
    command: mlflow server --host 0.0.0.0
    volumes:
      - ./mlruns:/mlruns
  
  voila:
    image: voila/voila
    ports:
      - "8866:8866"
    volumes:
      - ./dashboards:/dashboards
    command: voila --no-browser /dashboards

Kubernetes (production):

# Use Zero to JupyterHub
helm repo add jupyterhub https://jupyterhub.github.io/helm-chart/
helm install jhub jupyterhub/jupyterhub --values config.yaml

Integration:

# In notebook, connect to platform services

# MLflow
import mlflow
mlflow.set_tracking_uri("http://mlflow:5000")

# Shared storage
import s3fs
fs = s3fs.S3FileSystem(anon=False)
df = pd.read_csv(fs.open('s3://shared-data/dataset.csv'))

# Database
from sqlalchemy import create_engine
engine = create_engine('postgresql://db:5432/analytics')

Learning milestones:

  1. Deploy JupyterHub → Multi-user Jupyter
  2. Add MLflow → Experiment tracking
  3. Configure storage → Shared data access
  4. Set up auth → OAuth/LDAP
  5. Add Voilà → Dashboard deployment

Project Comparison Table

# Project Difficulty Time Key Skill Fun
1 Interactive Data Explorer Weekend Cell Execution ⭐⭐⭐
2 Literate Programming Tutorial 3-5 days Markdown/LaTeX ⭐⭐⭐
3 Reproducible Research ⭐⭐ 1 week Environment Management ⭐⭐⭐
4 Interactive Dashboard ⭐⭐ 1-2 weeks ipywidgets ⭐⭐⭐⭐
5 Kernel Architecture ⭐⭐⭐ 2-3 weeks ZeroMQ/IPC ⭐⭐⭐⭐
6 Multi-Language Notebook ⭐⭐ 1-2 weeks Polyglot Programming ⭐⭐⭐⭐
7 ML Experiment Tracker ⭐⭐ 1-2 weeks MLflow/W&B ⭐⭐⭐⭐
8 Automated Report Generator ⭐⭐ 1 week Papermill ⭐⭐⭐
9 JupyterLab Extension ⭐⭐⭐⭐ 3-4 weeks TypeScript/React ⭐⭐⭐⭐
10 Real-Time Collaboration ⭐⭐⭐⭐ 4-6 weeks CRDTs/WebSocket ⭐⭐⭐⭐⭐
11 Notebook Testing ⭐⭐ 1-2 weeks pytest/nbval ⭐⭐⭐
12 GPU-Accelerated DS ⭐⭐⭐ 2-3 weeks RAPIDS/cuDF ⭐⭐⭐⭐
13 Notebook-to-Production ⭐⭐⭐ 2-3 weeks nbdev ⭐⭐⭐⭐
14 Complete DS Platform ⭐⭐⭐⭐⭐ 2-3 months Full Stack ⭐⭐⭐⭐⭐

Phase 1: Fundamentals (1-2 weeks)

Understand why notebooks exist and basic usage:

  1. Project 1: Interactive Data Explorer - Core notebook skills
  2. Project 2: Literate Programming Tutorial - Documentation features

Phase 2: Professional Usage (2-3 weeks)

Learn to use notebooks professionally:

  1. Project 3: Reproducible Research - Environment management
  2. Project 4: Interactive Dashboard - Widgets and interactivity
  3. Project 8: Automated Report Generator - Parameterized notebooks

Phase 3: Data Science Workflows (2-3 weeks)

Apply notebooks to data science:

  1. Project 6: Multi-Language Notebook - Polyglot programming
  2. Project 7: ML Experiment Tracker - MLOps integration
  3. Project 11: Notebook Testing - Quality assurance

Phase 4: Advanced Architecture (3-4 weeks)

Understand how notebooks work:

  1. Project 5: Kernel Architecture - Under the hood
  2. Project 9: JupyterLab Extension - Extending Jupyter

Phase 5: Production & Scale (4-8 weeks)

Enterprise-grade notebook workflows:

  1. Project 12: GPU-Accelerated DS - High-performance computing
  2. Project 13: Notebook-to-Production - Code extraction
  3. Project 10: Real-Time Collaboration - Multi-user editing
  4. Project 14: Complete DS Platform - Full infrastructure

Final Project: End-to-End Data Science Workflow

Following the same pattern as above, this capstone applies everything:

What you’ll build: A complete end-to-end data science workflow in notebooks:

  1. Data ingestion from multiple sources (APIs, databases, files)
  2. Exploratory data analysis with rich visualizations
  3. Feature engineering with documentation
  4. Model development with experiment tracking
  5. Interactive dashboard for stakeholders
  6. Automated reporting pipeline
  7. Production package extraction
  8. CI/CD for the entire workflow

This final project demonstrates:

  • When to use notebooks vs. pure code
  • Professional notebook organization
  • Reproducibility at every step
  • From exploration to production

Summary

# Project Main Language
1 Interactive Data Explorer Python
2 Literate Programming Tutorial Python/Markdown
3 Reproducible Research Document Python
4 Interactive Visualization Dashboard Python
5 Kernel Architecture Deep Dive Python
6 Multi-Language Notebook Python/R/Julia
7 Machine Learning Experiment Tracker Python
8 Automated Report Generator Python
9 JupyterLab Extension TypeScript
10 Real-Time Collaborative Notebook Python/TypeScript
11 Notebook Testing Framework Python
12 GPU-Accelerated Data Science Python
13 Notebook-to-Production Pipeline Python
14 Complete Data Science Platform Python

Resources

Essential Books

  • “Python for Data Analysis” by Wes McKinney - Pandas creator, covers notebooks
  • “Hands-On Machine Learning” by Aurélien Géron - Uses notebooks throughout
  • “Data Science at the Command Line” by Jeroen Janssens - Alternative perspective

Tools

  • Jupyter Notebook: https://jupyter.org/ - Classic interface
  • JupyterLab: https://jupyterlab.readthedocs.io/ - Modern interface
  • Google Colab: https://colab.research.google.com/ - Free GPU notebooks
  • VSCode Jupyter: https://code.visualstudio.com/docs/datascience/jupyter-notebooks
  • nbdev: https://nbdev.fast.ai/ - Notebooks to packages
  • Voilà: https://voila.readthedocs.io/ - Notebooks to dashboards

Documentation

Practice Platforms

  • Kaggle Notebooks: https://www.kaggle.com/code - Data science competitions
  • Observable: https://observablehq.com/ - JavaScript notebooks
  • Deepnote: https://deepnote.com/ - Collaborative notebooks

Total Estimated Time: 4-6 months of dedicated study

After completion: You’ll understand exactly why notebooks exist, when to use them (and when not to), how to build professional data science workflows, and how the entire Jupyter architecture works from browser to kernel. You’ll be able to build interactive dashboards, automate reports, track experiments, and even extend JupyterLab with custom functionality.