Project 15: A/B Testing Framework
A statistical testing framework that analyzes A/B test results, computing p-values, confidence intervals, and recommending whether the difference is statistically significant.
Quick Reference
| Attribute | Value |
|---|---|
| Difficulty | Level 2: Intermediate (The Developer) |
| Main Programming Language | Python |
| Alternative Programming Languages | R, JavaScript, Go |
| Coolness Level | Level 3: Genuinely Clever |
| Business Potential | 3. The “Service & Support” Model (B2B Utility) |
| Knowledge Area | Hypothesis Testing / Statistics |
| Software or Tool | A/B Testing Tool |
| Main Book | “Think Stats” by Allen Downey |
1. Learning Objectives
By completing this project, you will:
- Translate math definitions into deterministic implementation steps.
- Build validation checks that make correctness observable.
- Diagnose numerical, logical, and data-shape failures early.
- Explain tradeoffs in interviews using evidence from your own build.
2. All Theory Needed (Per-Concept Breakdown)
This project applies the following theory clusters:
- Symbolic-to-numeric translation (expressions, data shapes, invariants)
- Stability constraints (precision, scaling, stopping criteria)
- Optimization or inference logic (depending on project objective)
- Evaluation discipline (error analysis, test coverage, reproducibility)
Concept A: Mathematical Representation Discipline
Fundamentals A math expression is not executable until you define representation, ordering, and domain constraints. The same equation can be represented as a token stream, tree, matrix pipeline, or probability graph. Choosing representation determines what bugs you can catch early.
Deep Dive into the concept Most project failures begin before algorithm selection: they start with ambiguous representation. If your parser cannot distinguish unary minus from subtraction, your calculator fails. If your matrix dimensions are implicit rather than validated, your linear algebra pipeline fails silently. If your probabilistic assumptions (independence, stationarity, or class priors) are not explicit, your inference can look accurate on one split and collapse on another. The core implementation move is to treat representation as a contract. Define each object with shape, domain, and semantic intent. Then enforce invariants at boundaries: input parser, preprocessing, training loop, evaluation stage. This makes debugging local instead of global.
How this fits this project You will encode each operation with explicit contracts and invariant checks.
Definitions & key terms
- Invariant: Property that must hold before and after each operation.
- Shape contract: Expected dimensional structure of vectors/matrices/tensors.
- Domain constraint: Allowed value range (for example log input > 0).
Mental model diagram
User Input -> Representation Layer -> Validated Operation -> Observable Output
(tokens/shapes) (invariants pass) (tests/plots/logs)
How it works
- Parse/ingest data into typed structures.
- Validate shape/domain invariants.
- Execute operation.
- Compare observed output with expected behavior.
- Record failure signature if mismatch appears.
Minimal concrete example
PSEUDOCODE
read expression
tokenize with precedence rules
if token sequence invalid -> return syntax error
evaluate tree
if domain violation -> return bounded diagnostic
print value and confidence check
Common misconceptions
- “If it runs once, representation is correct.” -> false.
- “Type checks are enough without shape checks.” -> false.
Check-your-understanding questions
- Which invariant catches division-by-zero earliest?
- Why does shape validation belong at boundaries rather than only in core logic?
- Predict failure if tokenization ignores unary minus.
Check-your-understanding answers
- Domain check on denominator before operation execution.
- Boundary validation keeps errors local and diagnostic.
- Expressions like
-2^2get misinterpreted and produce wrong precedence behavior.
Real-world applications Feature preprocessing, model-serving input validation, and experiment-tracking schema enforcement.
Where you’ll apply it This project and every downstream project in the sprint.
References
- CSAPP (Bryant & O’Hallaron), floating-point chapter
- Math for Programmers (Paul Orland), representation-oriented chapters
Key insight Correct representation reduces the complexity of every later decision.
Summary Stable ML math implementations start with explicit contracts, not implicit assumptions.
Homework/Exercises
- Write five invariants for your project.
- Build a failing test input for each invariant.
Solutions
- Include at least one shape, one domain, one convergence, one reproducibility, and one output-range invariant.
- Each failing input should trigger exactly one diagnostic to keep root-cause analysis clean.
3. Build Blueprint
- Scope the smallest end-to-end slice that produces visible output.
- Add deterministic tests and edge-case probes.
- Layer complexity only after baseline behavior is stable.
- Add metrics logging before optimization.
- Run failure drills: perturb inputs, scale values, and check stability.
4. Real-World Outcome (Target)
$ python ab_test.py results.csv
A/B Test Analysis
=================
Control (A):
Samples: 10,000
Conversions: 312 (3.12%)
Treatment (B):
Samples: 10,000
Conversions: 378 (3.78%)
Relative improvement: +21.2%
Statistical Analysis:
Difference: 0.66 percentage points
95% Confidence Interval: [0.21%, 1.11%]
p-value: 0.0042
Interpretation:
✓ Result is statistically significant (p < 0.05)
✓ Confidence interval doesn't include 0
Recommendation: Treatment B is a WINNER.
The improvement is real with 99.6% confidence.
Power analysis:
To detect a 10% relative improvement with 80% power,
you would need ~25,000 samples per group.
Implementation Hints: For proportions (conversion rates), use a z-test:
p1 = conversions_A / samples_A
p2 = conversions_B / samples_B
p_pooled = (conversions_A + conversions_B) / (samples_A + samples_B)
se = sqrt(p_pooled * (1-p_pooled) * (1/samples_A + 1/samples_B))
z = (p2 - p1) / se
# p-value from standard normal CDF
Confidence interval: (p2 - p1) ± 1.96 * se for 95% CI.
Learning milestones:
- p-value computed correctly → You understand hypothesis testing
- Confidence intervals are correct → You understand uncertainty
- You can explain what p-value actually means → You’ve avoided common misconceptions
5. Core Design Notes from Main Guide
Core Question
How do we distinguish real effects from random noise?
Every day, companies run experiments: does the new button color increase clicks? Does the new algorithm improve engagement? But even with no real difference, random variation will make one group look better. A/B testing gives us the mathematical machinery to answer: “Is this difference real, or could it have happened by chance?” This is the foundation of evidence-based decision making.
Concepts You Must Understand First
Stop and research these before coding:
- Null and Alternative Hypotheses
- What is the null hypothesis in an A/B test? (No difference between groups)
- What is the alternative hypothesis? (There is a difference)
- Why do we try to reject the null rather than prove the alternative?
- Book Reference: “Think Stats” Chapter 7 - Allen Downey
- The p-value
- What does a p-value actually measure?
- Why is p-value NOT the probability that the null hypothesis is true?
- What does “statistically significant at alpha = 0.05” mean?
- Why is the threshold 0.05 arbitrary?
- Book Reference: “Statistics Done Wrong” Chapter 1 - Alex Reinhart
- Type I and Type II Errors
- What is a false positive (Type I error)?
- What is a false negative (Type II error)?
- Why can’t we minimize both simultaneously?
- What is the relationship between alpha and Type I error rate?
- Book Reference: “All of Statistics” Chapter 10 - Larry Wasserman
- The t-test and z-test
- When do you use z-test vs t-test?
- What is the test statistic measuring?
- Why does sample size affect the test statistic?
- What assumptions does the t-test make?
- Book Reference: “Think Stats” Chapter 9 - Allen Downey
- Confidence Intervals
- What does a “95% confidence interval” actually mean?
- How is a confidence interval related to hypothesis testing?
- Why does the interval get narrower with more samples?
- Book Reference: “All of Statistics” Chapter 6 - Larry Wasserman
- Statistical Power and Sample Size
- What is statistical power? (Probability of detecting a real effect)
- Why is 80% power a common target?
- How do you calculate required sample size for a desired power?
- What is the relationship between effect size, sample size, and power?
- Book Reference: “Statistics Done Wrong” Chapter 4 - Alex Reinhart
Questions to Guide Your Design
Before implementing, think through these:
-
Input Format: How will you accept A/B test data? Two lists of outcomes? A CSV with group labels?
-
Test Selection: Will you implement both z-test (for proportions) and t-test (for means)? How will you choose which to use?
-
Two-tailed vs One-tailed: Will you support both? What’s the difference in p-value calculation?
-
Confidence Interval Method: Will you use the normal approximation or exact methods? When might the approximation fail?
-
Sample Size Calculator: How will you implement power analysis? What inputs do you need (baseline rate, minimum detectable effect, power, alpha)?
-
Multiple Testing: What happens when someone runs many A/B tests? How would you handle the multiple comparison problem?
Thinking Exercise
Work through a complete A/B test by hand:
An e-commerce site runs an A/B test on a new checkout button:
- Control (A): 1000 visitors, 50 conversions (5.0% conversion rate)
- Treatment (B): 1000 visitors, 65 conversions (6.5% conversion rate)
Step 1: State the hypotheses
- H0: p_A = p_B (no difference)
- H1: p_A != p_B (there is a difference)
Step 2: Compute the pooled proportion
p_pooled = (50 + 65) / (1000 + 1000) = ?
Step 3: Compute the standard error
SE = sqrt(p_pooled * (1 - p_pooled) * (1/n_A + 1/n_B))
= sqrt(? * ? * (1/1000 + 1/1000))
= ?
Step 4: Compute the z-statistic
z = (p_B - p_A) / SE
= (0.065 - 0.050) / ?
= ?
Step 5: Find the p-value Using a standard normal table or calculator:
-
For two-tailed test: p-value = 2 * P(Z > z ) = ?
Step 6: Make a decision
- If p-value < 0.05, reject H0
- Is the result statistically significant?
Step 7: Compute confidence interval
CI = (p_B - p_A) +/- 1.96 * SE
= ? +/- ?
= [?, ?]
Does this interval include 0? Does that match your hypothesis test conclusion?
Interview Questions
- “Explain what a p-value is.”
- Expected: The probability of observing results as extreme as ours (or more extreme), assuming the null hypothesis is true. NOT the probability that the null is true.
- “What is the difference between statistical significance and practical significance?”
- Expected: Statistical significance means unlikely due to chance. Practical significance means the effect size matters for business. A tiny effect can be statistically significant with large samples.
- “How do you determine sample size for an A/B test?”
- Expected: Power analysis. Need baseline conversion rate, minimum detectable effect, desired power (usually 80%), and significance level (usually 0.05).
- “What is p-hacking and how do you avoid it?”
- Expected: Testing multiple hypotheses until finding significance, or stopping early when significance is reached. Avoid by pre-registering hypotheses, using correction for multiple comparisons, and fixed sample sizes.
- “What happens if you run 20 A/B tests with no real effects?”
- Expected: At alpha = 0.05, expect 1 false positive on average. This is the multiple testing problem. Use Bonferroni correction or False Discovery Rate methods.
- “Can you reject the null hypothesis with a small sample?”
- Expected: Yes, if the effect is very large. But small samples give wide confidence intervals. Statistical power is low, so you might miss real effects.
- “What’s the difference between a confidence interval and a credible interval?”
- Expected: Confidence interval is frequentist: in repeated experiments, 95% of intervals contain the true value. Credible interval is Bayesian: there’s 95% probability the true value is in this interval.
Hints in Layers (Treat as pseudocode guidance)
Hint 1: Basic Test Structure For proportion tests, you need: number of successes and total trials for each group. From these, compute sample proportions.
Hint 2: The Z-test for Proportions
def z_test_proportions(successes_a, n_a, successes_b, n_b):
p_a = successes_a / n_a
p_b = successes_b / n_b
# Pooled proportion under null hypothesis
p_pooled = (successes_a + successes_b) / (n_a + n_b)
# Standard error
se = math.sqrt(p_pooled * (1 - p_pooled) * (1/n_a + 1/n_b))
# Z statistic
z = (p_b - p_a) / se
return z
Hint 3: P-value from Z-score
from scipy import stats
# Two-tailed p-value
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
# Or equivalently
p_value = 2 * stats.norm.sf(abs(z))
Hint 4: Confidence Interval
def confidence_interval(p_a, p_b, n_a, n_b, confidence=0.95):
# Standard error for difference (not pooled)
se = math.sqrt(p_a*(1-p_a)/n_a + p_b*(1-p_b)/n_b)
# Z critical value
z_crit = stats.norm.ppf((1 + confidence) / 2)
diff = p_b - p_a
margin = z_crit * se
return (diff - margin, diff + margin)
Hint 5: Sample Size Calculation
def required_sample_size(baseline, mde, power=0.8, alpha=0.05):
"""
baseline: current conversion rate (e.g., 0.05)
mde: minimum detectable effect (relative, e.g., 0.1 for 10% lift)
"""
p1 = baseline
p2 = baseline * (1 + mde)
# Z values
z_alpha = stats.norm.ppf(1 - alpha/2)
z_beta = stats.norm.ppf(power)
# Pooled variance estimate
p_avg = (p1 + p2) / 2
n = 2 * ((z_alpha * math.sqrt(2*p_avg*(1-p_avg)) +
z_beta * math.sqrt(p1*(1-p1) + p2*(1-p2))) ** 2) / (p2 - p1) ** 2
return math.ceil(n)
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Hypothesis Testing Foundations | “Think Stats” | Chapter 7 - Allen Downey |
| Common Statistical Mistakes | “Statistics Done Wrong” | All Chapters - Alex Reinhart |
| Confidence Intervals | “All of Statistics” | Chapter 6 - Larry Wasserman |
| Power Analysis | “Statistics Done Wrong” | Chapter 4 - Alex Reinhart |
| The t-test in Depth | “Think Stats” | Chapter 9 - Allen Downey |
| Multiple Testing | “All of Statistics” | Chapter 10 - Larry Wasserman |
| Bayesian A/B Testing | “Think Bayes” | Chapter 8 - Allen Downey |
6. Validation, Pitfalls, and Completion
Common Pitfalls and Debugging
Problem 1: “Outputs drift after a few iterations”
- Why: Hidden numerical instability (unscaled features, aggressive step size, or repeated subtraction of nearly equal values).
- Fix: Normalize inputs, reduce step size, and track relative error rather than only absolute error.
- Quick test: Run the same task with two scales of input (for example x and 10x) and compare normalized error curves.
Problem 2: “Results are inconsistent across runs”
- Why: Random seeds, data split randomness, or non-deterministic ordering are uncontrolled.
- Fix: Set seeds, log configuration, and store split indices and hyperparameters with each run.
- Quick test: Re-run three times with the same seed and confirm metrics remain inside a tight tolerance band.
Problem 3: “The project works on the demo case but fails on edge cases”
- Why: Tests only cover happy-path inputs.
- Fix: Add adversarial inputs (empty values, extreme ranges, near-singular matrices, rare classes).
- Quick test: Build an edge-case test matrix and ensure every scenario reports expected behavior.
Definition of Done
- Core functionality works on reference inputs
- Edge cases are tested and documented
- Results are reproducible (seeded and versioned configuration)
- Performance or convergence behavior is measured and explained
- A short retrospective explains what failed first and how you fixed it
7. Extension Ideas
- Add a stress-test mode with adversarial inputs.
- Add a short benchmark report (runtime + memory + error trend).
- Add a reproducibility bundle (seed, config, and fixed test corpus).
8. Why This Project Matters
Not specified
This project is valuable because it creates observable evidence of mathematical reasoning under real implementation constraints.