Project 13: Distribution Sampler and Visualizer
A tool that generates samples from various probability distributions (uniform, normal, exponential, Poisson, binomial) and visualizes them as histograms, showing how they match the theoretical PDF/PMF.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate (The Developer) |
| Main Programming Language | Python |
| Alternative Programming Languages | Julia, R, JavaScript |
| Coolness Level | Level 2: Practical but Forgettable |
| Business Potential | 2. The “Micro-SaaS / Pro Tool” (Solo-Preneur Potential) |
| Knowledge Area | Probability Distributions / Statistics |
| Software or Tool | Distribution Toolkit |
| Main Book | “Think Stats” by Allen Downey |
1. Learning Objectives
By completing this project, you will:
- Translate math definitions into deterministic implementation steps.
- Build validation checks that make correctness observable.
- Diagnose numerical, logical, and data-shape failures early.
- Explain tradeoffs in interviews using evidence from your own build.
2. All Theory Needed (Per-Concept Breakdown)
This project applies the following theory clusters:
- Symbolic-to-numeric translation (expressions, data shapes, invariants)
- Stability constraints (precision, scaling, stopping criteria)
- Optimization or inference logic (depending on project objective)
- Evaluation discipline (error analysis, test coverage, reproducibility)
Concept A: Mathematical Representation Discipline
Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.
Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.
How this fits this project You will encode each operation with explicit contracts and invariant checks.
Definitions & key terms
- Invariant: Property that must hold before and after each operation.
- Shape contract: Expected dimensional structure of vectors/matrices/tensors.
- Domain constraint: Allowed value range (for example log input > 0).
Mental model diagram
User Input -> Representation Layer -> Validated Operation -> Observable Output
(tokens/shapes) (invariants pass) (tests/plots/logs)
How it works
- Parse/ingest data into typed structures.
- Validate shape/domain invariants.
- Execute operation.
- Compare observed output with expected behavior.
- Record failure signature if mismatch appears.
Minimal concrete example
PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check
Common misconceptions
- “If it runs once, representation is correct.” -> false.
- “Type checks are enough without shape checks.” -> false.
Check-your-understanding questions
- Which invariant catches division-by-zero earliest?
- Why does shape validation belong at boundaries rather than only in core logic?
- Predict failure if tokenization ignores unary minus.
Check-your-understanding answers
- Domain check on denominator before operation execution.
- Boundary validation keeps errors local and diagnostic.
- Expressions like
-2^2get misinterpreted and produce wrong precedence behavior.
Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.
Where you’ll apply it This project and every downstream project in the sprint.
References
- CSAPP (Bryant & O’Hallaron), floating-point chapter
- Math for Programmers (Paul Orland), representation-oriented chapters
Key insight Correct representation reduces the complexity of every later decision.
Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.
Homework/Exercises
- Write five invariants for your project.
- Build a failing test input for each invariant.
Solutions
- Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
- Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.
3. Build Blueprint
- Scope the smallest end-to-end slice that produces visible output.
- Add deterministic tests and edge-case probes.
- Layer complexity only after baseline behavior is stable.
- Add metrics logging before optimization.
- Run failure drills: perturb inputs, scale values, and check stability.
4. Real-World Outcome (Target)
$ python distributions.py normal --mean=0 --std=1 --n=10000
Generating 10,000 samples from Normal(μ=0, σ=1)
Sample statistics:
Mean: 0.003 (theoretical: 0)
Std Dev: 1.012 (theoretical: 1)
Skewness: 0.021 (theoretical: 0)
[Histogram with overlaid theoretical normal curve]
[68% of samples within ±1σ, 95% within ±2σ, 99.7% within ±3σ]
$ python distributions.py poisson --lambda=5 --n=10000
Generating 10,000 samples from Poisson(λ=5)
[Bar chart of counts 0,1,2,3... with theoretical probabilities overlaid]
P(X=5) observed: 0.172, theoretical: 0.175 ✓
Implementation Hints: Box-Muller for normal: if U1, U2 are uniform(0,1):
z1 = sqrt(-2 * log(u1)) * cos(2 * pi * u2)
z2 = sqrt(-2 * log(u1)) * sin(2 * pi * u2)
z1, z2 are independent standard normal.
For Poisson(λ), use: count events until cumulative probability exceeds a uniform random.
Learning milestones:
- Histogram matches theoretical distribution → You understand sampling
- Sample statistics match theoretical values → You understand expected value
- Central Limit Theorem demonstrated → You understand why normal is everywhere
5. Core Design Notes from Main Guide
Core Question
Why does the same bell curve appear everywhere in nature?
From heights of people to measurement errors to stock price changes, the normal distribution emerges again and again. This project helps you understand why: the Central Limit Theorem says that averages of any distribution tend toward normal. By sampling from various distributions and watching their averages converge to normality, you’ll witness one of mathematics’ most beautiful theorems.
Concepts You Must Understand First
Stop and research these before coding:
- Probability Density Functions (PDFs) vs Probability Mass Functions (PMFs)
- What’s the difference between continuous and discrete distributions?
- Why does a PDF give probability density rather than probability?
- How do you get P(a < X < b) from a PDF?
- Book Reference: “Think Stats” Chapter 3 - Allen Downey
- The Normal (Gaussian) Distribution
- What do the parameters mu and sigma mean geometrically?
- What is the 68-95-99.7 rule?
- Why is the normal distribution called “normal”?
- What is the standard normal distribution Z ~ N(0,1)?
- Book Reference: “All of Statistics” Chapter 3 - Larry Wasserman
- The Box-Muller Transform
- How do you generate normally distributed numbers from uniformly distributed ones?
- Why does this magical formula work: Z = sqrt(-2ln(U1)) * cos(2pi*U2)?
- What is the polar form of Box-Muller, and why is it more efficient?
- Book Reference: “Numerical Recipes” Chapter 7 - Press et al.
- Moments of Distributions
- What are the first four moments (mean, variance, skewness, kurtosis)?
- How do you estimate moments from samples?
- What do skewness and kurtosis tell you about a distribution’s shape?
- Book Reference: “All of Statistics” Chapter 3 - Larry Wasserman
- The Central Limit Theorem
- What does the CLT actually state?
- Why do averages of non-normal distributions become normal?
- How fast does convergence to normality happen?
- Book Reference: “Think Stats” Chapter 6 - Allen Downey
- The Poisson Distribution
- When do we use Poisson? (Counts of rare events in fixed intervals)
- What is the relationship between lambda and both mean and variance?
- How is Poisson related to the binomial for large n, small p?
- Book Reference: “All of Statistics” Chapter 2 - Larry Wasserman
Questions to Guide Your Design
Before implementing, think through these:
-
Random Number Foundation: All your distributions will be built from uniform random numbers. How will you generate U ~ Uniform(0, 1)?
-
Distribution Interface: What common interface should all your distributions have? Parameters, sample(), pdf(), mean(), variance()?
-
Box-Muller Implementation: Will you use the basic form or the polar (rejection) form? Why might you choose one over the other?
-
Histogram Binning: How many bins should your histogram have? How do you determine bin edges? What is the Freedman-Diaconis rule?
-
PDF Overlay: How will you overlay the theoretical PDF on your histogram? Remember to scale the PDF to match histogram height!
-
CLT Demonstration: How will you show the CLT in action? Repeatedly take means of samples from a non-normal distribution?
Thinking Exercise
Verify the Box-Muller transform by hand:
The Box-Muller transform says: if U1, U2 ~ Uniform(0,1), then:
- Z1 = sqrt(-2 * ln(U1)) * cos(2 * pi * U2)
- Z2 = sqrt(-2 * ln(U1)) * sin(2 * pi * U2)
are independent standard normal random variables.
Test with specific values: Let U1 = 0.3, U2 = 0.7
- Compute R = sqrt(-2 * ln(0.3)) = sqrt(-2 * (-1.204)) = sqrt(2.408) = ?
- Compute theta = 2 * pi * 0.7 = ?
- Z1 = R * cos(theta) = ?
- Z2 = R * sin(theta) = ?
Now think: If you generate 10,000 (Z1, Z2) pairs and plot them, what shape should you see?
Central Limit Theorem exercise: Take 100 samples from an exponential distribution (highly skewed, not normal at all). Compute their mean. Repeat this 1000 times to get 1000 means. Plot a histogram of these means. What shape do you see?
Interview Questions
- “Explain the Central Limit Theorem and why it matters.”
- Expected: Sample means converge to normal distribution regardless of original distribution. It’s why normal appears everywhere and why we can do statistical inference.
- “How would you generate samples from a normal distribution using only uniform random numbers?”
- Expected: Box-Muller transform. Can also mention inverse CDF method for general distributions.
- “What’s the difference between the Normal and Standard Normal distribution?”
- Expected: Standard normal has mean 0, std 1. Any normal X can be standardized: Z = (X - mu) / sigma.
- “When would you use a Poisson distribution vs a Normal distribution?”
- Expected: Poisson for counts of rare events (discrete, non-negative). Normal for continuous measurements, or as approximation when Poisson lambda is large.
- “How do you test if data follows a specific distribution?”
- Expected: Q-Q plots, Kolmogorov-Smirnov test, chi-squared goodness of fit, Shapiro-Wilk for normality.
- “What is the variance of the sample mean?”
- Expected: Var(sample mean) = sigma^2 / n. This is why larger samples give more precise estimates.
- “Explain the relationship between the exponential and Poisson distributions.”
- Expected: If events arrive according to Poisson process, inter-arrival times are exponential.
Hints in Layers (Treat as pseudocode guidance)
Hint 1: Start with Uniform Python’s random.random() gives U ~ Uniform(0,1). This is your building block for everything else.
Hint 2: Box-Muller Implementation
def standard_normal():
u1 = random.random()
u2 = random.random()
z = math.sqrt(-2 * math.log(u1)) * math.cos(2 * math.pi * u2)
return z
For general normal: mu + sigma * standard_normal()
Hint 3: Poisson Sampling Use the inverse transform method:
def poisson(lam):
L = math.exp(-lam)
k = 0
p = 1.0
while p > L:
k += 1
p *= random.random()
return k - 1
Hint 4: Histogram with PDF Overlay
samples = [normal(0, 1) for _ in range(10000)]
plt.hist(samples, bins=50, density=True, alpha=0.7)
x = np.linspace(-4, 4, 100)
plt.plot(x, stats.norm.pdf(x), 'r-', linewidth=2)
Hint 5: CLT Demonstration
# Take means of exponential samples
means = []
for _ in range(1000):
samples = [random.expovariate(1.0) for _ in range(100)]
means.append(sum(samples) / len(samples))
plt.hist(means, bins=50, density=True)
# This will look normal even though exponential isn't!
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Probability Distributions Overview | “Think Stats” | Chapter 3 - Allen Downey |
| Normal Distribution Deep Dive | “All of Statistics” | Chapter 3 - Larry Wasserman |
| Random Variate Generation | “Numerical Recipes” | Chapter 7 - Press et al. |
| Box-Muller and Transforms | “The Art of Computer Programming Vol 2” | Section 3.4 - Donald Knuth |
| Central Limit Theorem | “All of Statistics” | Chapter 5 - Larry Wasserman |
| Poisson and Exponential | “Introduction to Probability” | Chapter 6 - Blitzstein & Hwang |
6. Validation, Pitfalls, and Completion
Common Pitfalls and Debugging
Problem 1: “Outputs drift after a few iterations”
- Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
- Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
- Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.
Problem 2: “Results are inconsistent across runs”
- Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
- Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
- Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.
Problem 3: “The project works on the demo case but fails on edge cases”
- Why: Tests only cover happy-path inputs.
- Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
- Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.
Definition of Done
- Core functionality works on reference inputs
- Edge cases are tested and documented
- Results are reproducible (seeded and versioned configuration)
- Performance or convergence behavior is measured and explained
- A short retrospective explains what failed first and how you fixed it
7. Extension Ideas
- Add a stress-test mode with adversarial inputs.
- Add a short benchmark report (runtime + memory + error trend).
- Add a reproducibility bundle (seed, config, and fixed test corpus).
8. Why This Project Matters
Not specified
This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.