← Back to all projects

LEARN AB TESTING EXPERIMENTATION PLATFORMS

In 2007, an engineer at Google wondered if a different shade of blue would increase clicks on search results. They tested 41 different shades. The result? An extra $200 million in annual revenue.

Learn A/B Testing & Experimentation Platforms: From Zero to Platform Architect

Goal: Deeply understand the engineering and statistics behind experimentation platforms—how to consistently assign users to treatments, track metrics at scale, and automate statistical rigor to turn raw data into confident product decisions. You will build a system that moves beyond “guessing” to a scientific, automated feedback loop.


Why Experimentation Platforms Matter

In 2007, an engineer at Google wondered if a different shade of blue would increase clicks on search results. They tested 41 different shades. The result? An extra $200 million in annual revenue.

But A/B testing isn’t just about colors. It’s the engine behind Amazon’s recommendation system, Netflix’s artwork selection, and Uber’s pricing algorithms. Without a robust platform, organizations fall into the “HiPPO” trap (Highest Paid Person’s Opinion). A platform democratizes data, allowing any idea to be tested against reality.

Building an experimentation platform is a unique engineering challenge that sits at the intersection of:

  • High-performance systems: Deciding a treatment in <5ms at the edge.
  • Distributed systems: Ensuring “sticky” assignments across millions of devices.
  • Statistics: Automating complex math so developers don’t have to be PhDs.
  • Data Engineering: Processing billions of events to calculate precise metrics.

Core Concept Analysis

The Anatomy of an Experiment

At its heart, an experimentation platform is a system that maps a User to a Treatment based on a Variable within a specific Context.

    [ User ID ] + [ Experiment ID ]
           |            |
           v            v
    +--------------------------+
    |    Deterministic Hash    | (e.g., MurmurHash3)
    +--------------------------+
           |
           v
    [ Numeric Bucket (0-99) ]
           |
    +------+------+------+
    |             |      |
 [ 0-49 ]      [ 50-99 ] |
    |             |      |
    v             v      v
Treatment A   Treatment B  (Not in Experiment)
(Control)     (Variant)

The Platform Architecture

A mature platform consists of four major layers:

  1. Assignment (The Brain): Fast, deterministic logic that tells the client which version to show.
  2. Telemetry (The Nerves): High-throughput ingestion of “exposure” events (who saw what) and “metric” events (what they did).
  3. Analytics (The Digestion): Aggregation of events into cohorts (Group A vs. Group B).
  4. Statistics (The Judge): Comparison of groups to determine if the difference is “real” or just “noise” (p-values, confidence intervals).
CLIENT SIDE                BACKEND / DATA WAREHOUSE
+-------------+            +--------------------------+
|  SDK        |---(1)---> | Assignment Engine        |
|  (Decision) |           | (Feature Flags/Toggles)  |
+-------------+            +--------------------------+
      |
      | (2) Exposure Event ("User X saw Variant B")
      v
+-------------+            +--------------------------+
| Telemetry   |---(3)---> | Data Pipeline            |
| Ingestion   |           | (Kafka/Kinesis)          |
+-------------+            +--------------------------+
                               |
                               v
                           +--------------------------+
                           | Stats Engine             |
                           | (P-values, CI, Power)    |
                           +--------------------------+

Concept Summary Table

Concept Cluster What You Need to Internalize
Deterministic Hashing Assignments must be “sticky” and uniform. Hashing (not random numbers) ensures the same user gets the same variant every time without storing state.
Exposure vs. Event You can only analyze users who actually saw the experiment. Distinguishing “Exposures” from general “Events” is the foundation of valid analysis.
Statistical Significance The probability that the observed difference is NOT due to random chance. p < 0.05 is the industry standard but often misunderstood.
Feature Toggling The ability to decouple deployment from release. Every experiment is a toggle, but not every toggle is an experiment.
SRM (Sample Ratio Mismatch) A critical bug-detector. If you expect a 50/50 split but get 48/52, your data is likely tainted.

Deep Dive Reading by Concept

This section maps each concept to specific book chapters. Read these alongside the projects to build strong mental models.

Statistical Foundations

Concept Book & Chapter
The P-Value & Null Hypothesis “Trustworthy Online Controlled Experiments” by Kohavi et al. — Ch. 17: “Statistics Behind Online Controlled Experiments”
Confidence Intervals “Statistical Methods in Online A/B Testing” by Georgi Georgiev — Ch. 3: “Statistical Significance and Confidence”
Bayesian vs. Frequentist “Experimentation for Engineers” by David Sweet — Ch. 8: “Bayesian Optimization”

Engineering & Platform Design

Concept Book & Chapter
Assignment Mechanisms “Trustworthy Online Controlled Experiments” by Kohavi et al. — Ch. 4: “Experimentation Platform and Analysis”
Feature Flags & Toggles “Practical A/B Testing” by Leemay Nassery — Ch. 2: “The Engineering of Feature Flags”
Telemetry & Data Quality “Trustworthy Online Controlled Experiments” by Kohavi et al. — Ch. 13: “Data Quality”

Essential Reading Order

  1. The Fundamentals (Week 1):
    • Trustworthy Online Controlled Experiments Ch. 1 & 2 (Introduction and Strategy)
    • Experimentation for Engineers Ch. 1 & 2 (Basics of Experiments)
  2. The Math (Week 2):
    • Statistical Methods in Online A/B Testing Ch. 1-3 (Significance and Power)
  3. The Architecture (Week 3):
    • Practical A/B Testing Ch. 4 (Building vs. Buying)

Project List

Projects are ordered from fundamental understanding to advanced implementations.


Project 1: The Deterministic Allocator (The Heart of Sticky Assignments)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Go
  • Alternative Programming Languages: Python, Rust, C++
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 1: Beginner
  • Knowledge Area: Distributed Systems / Hashing
  • Software or Tool: MurmurHash3
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: A library that takes a UserID and an ExperimentID and deterministically returns a bucket (0-99). You will then run a simulation to verify that 100,000 unique IDs are distributed uniformly across treatments.

Why it teaches A/B Testing: Most beginners think A/B testing uses rand(). This project teaches why rand() is forbidden. You’ll learn how to ensure a user stays in the same group across sessions without a database, and how to “salt” hashes to avoid correlation between different experiments.

Core challenges you’ll face:

  • Ensuring Uniformity → Maps to preventing biased splits (SRM).
  • Independence of Experiments → Maps to using Salts to ensure a user in Group A for Experiment 1 isn’t always in Group A for Experiment 2.
  • Performance → Maps to deciding assignments in microseconds.

Key Concepts:

  • Deterministic Hashing: [MurmurHash3 Algorithm - Wikipedia]
  • Bucketing Strategies: “Trustworthy Online Controlled Experiments” Ch. 4

Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, understanding of strings and integers.


Real World Outcome

You’ll have a CLI tool that can split users into groups and a validation script that proves the split is fair.

Example Output:

$ ./allocator --user "user_123" --experiment "new_checkout_flow"
User: user_123 | Experiment: new_checkout_flow | Bucket: 42 | Group: VARIANT_B

$ ./allocator --validate --users 100000 --split 50/50
Running simulation for 100,000 users...
[Results]
Group A (Control): 50,042 (50.04%)
Group B (Variant): 49,958 (49.96%)
Standard Deviation: 0.08%
Verdict: UNIFORM DISTRIBUTION VERIFIED

The Core Question You’re Answering

“How can I guarantee a user sees the same version of my app every time without saving their preference in a database?”

Before you write any code, realize that every database lookup adds latency. In high-scale systems, we “calculate” the truth rather than “looking it up.”


Concepts You Must Understand First

Stop and research these before coding:

  1. Hash Functions (Non-Cryptographic)
    • Why use MurmurHash or CityHash instead of SHA-256?
    • What is the “Avalanche Effect”?
    • Book Reference: “Data Structures and Algorithms” (any) - Hashing section.
  2. The Modulo Operator
    • How do you turn a massive 64-bit integer into a number between 0 and 99?
    • What happens if the hash isn’t perfectly uniform?

Questions to Guide Your Design

Before implementing, think through these:

  1. Salting
    • If I use the same hash for two experiments, will the same users always be in the same groups?
    • How should I combine UserID and ExperimentID before hashing? (Concatenation? Delimiters?)
  2. Scaling
    • If I change the split from 50/50 to 10/90, how does the bucket logic change?

Thinking Exercise

The “Correlation Trap”

Imagine two experiments:

  1. bg_color: (Red vs Blue)
  2. font_size: (Small vs Large)

If you just hash the UserID, the users who see the Red background will always see the Small font.

Questions while analyzing:

  • How do you “decorrelate” these two experiments using only a hash function?
  • What happens if you hash UserID + ExperimentID? Try to trace the bits of two similar strings.

The Interview Questions They’ll Ask

  1. “Why wouldn’t you use Math.random() to assign users to groups?”
  2. “What is a ‘sticky assignment’ and why is it important for user experience?”
  3. “How do you handle assignments for users who aren’t logged in?”
  4. “If you see a 45/55 split on a 50/50 experiment, what’s your first troubleshooting step?”
  5. “Explain the tradeoffs between server-side and client-side assignment.”

Hints in Layers

Hint 1: The Input Combine your User ID and Experiment ID into a single string. This is your “unique key” for this specific assignment.

Hint 2: The Hash Use a library for MurmurHash3 (32-bit or 128-bit). It’s fast and has great distribution properties.

Hint 3: The Bucket Convert the hash output (a big number) to a positive integer, then use % 100. This gives you a bucket from 0 to 99.

Hint 4: The Logic If bucket < 50, it’s Group A. Otherwise, it’s Group B. For a 10/90 split, if bucket < 10, it’s Group A.


Books That Will Help

Topic Book Chapter
Hashing for Experiments “Trustworthy Online Controlled Experiments” Ch. 4.2
Implementation Patterns “Experimentation for Engineers” Ch. 3

Project 2: Feature Toggle Engine with Targeting (Dynamic Control)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: TypeScript/Node.js
  • Alternative Programming Languages: Go, Python
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Web Rendering / Rule Engines
  • Software or Tool: JSON-based Rules
  • Main Book: “Practical A/B Testing” by Leemay Nassery

What you’ll build: A rule engine that evaluates whether a feature should be “ON” or “OFF” based on user attributes (e.g., country == 'US', app_version >= '2.0', subscription == 'premium').

Why it teaches A/B Testing: Experiments are just a special type of Feature Toggle. Before you can split traffic, you need to be able to target specific “cohorts” (e.g., “only test this on 10% of users in Canada”).

Core challenges you’ll face:

  • Rule Evaluation → Building a parser for logical conditions (AND/OR).
  • Attribute Matching → Handling different data types (SemVer, lists, booleans).
  • Fallbacks → Ensuring the app doesn’t crash if the config is missing.

Difficulty: Intermediate Time estimate: 1 week Prerequisites: Understanding of JSON, basic Boolean logic.


Real World Outcome

A “Config Service” that returns a set of enabled features for a specific user.

Example Usage (Client):

const user = { id: "u1", country: "US", version: "1.5.0" };
const flags = engine.getFlags(user);

if (flags.isEnabled("new_dashboard")) {
  renderNewDashboard();
}

Example Output (Server Logs):

Evaluating 'new_dashboard' for user u1...
- Rule 1 (Country is US): MATCH
- Rule 2 (Version >= 2.0.0): FAIL
- Final Result: OFF

The Core Question You’re Answering

“How can I change my app’s behavior for specific users without redeploying code?”


Concepts You Must Understand First

Stop and research these before coding:

  1. Decoupling Deployment from Release
    • What is a “Dark Launch”?
    • What is a “Canary Deployment”?
  2. JSON Schemas for Rules
    • How do you represent “If user is in [A, B, C]” in JSON?

Thinking Exercise

The Override Problem

Imagine a developer needs to test a feature that is currently “OFF” for everyone.

Questions:

  • How do you design your rules so a “Developer ID” always sees the feature regardless of the general rules?
  • Does the “Override” rule go at the top or the bottom of your JSON config? Why?

The Interview Questions They’ll Ask

  1. “What’s the difference between a Feature Toggle and an A/B Test?”
  2. “How do you prevent ‘Technical Debt’ when using feature flags?”
  3. “Explain how you would handle ‘Flag Dependencies’ (e.g., Feature B requires Feature A to be ON).”
  4. “What are the performance implications of evaluating complex rules on every request?”

Project 3: The Telemetry Ingestor (Tracking the “Truth”)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python (with FastAPI)
  • Alternative Programming Languages: Go, Java
  • Coolness Level: Level 1: Pure Corporate Snoozefest
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Engineering / Telemetry
  • Software or Tool: Redis or SQLite
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: A high-speed API that accepts two types of events:

  1. Exposure: “User X saw Experiment Y, Variant B”
  2. Action/Conversion: “User X clicked ‘Purchase’”

You will store these in a way that allows you to calculate the “Conversion Rate” for each variant later.

Why it teaches A/B Testing: Statistics are “garbage in, garbage out.” You’ll learn why tracking who saw the experiment is more important than tracking everyone. You’ll also grapple with the “At-Least-Once” vs. “Exactly-Once” delivery problems.

Core challenges you’ll face:

  • The Attribution Window → A purchase today might be because of an experiment seen 3 days ago. How do you link them?
  • Idempotency → What if a client sends the same exposure event twice?
  • Data Volume → Experiments generate 10x more logs than normal features. How do you keep it efficient?

Real World Outcome

A database (or CSV) containing a clean audit log of user behavior mapped to experiment groups.

Example Log Record: | timestamp | user_id | event_type | experiment_id | variant | value | |———–|———|————|—————|———|——-| | 10:00:01 | u1 | EXPOSURE | checkout_btn | B | null | | 10:05:22 | u1 | CONVERSION | null | null | 49.99 |


Project 4: The Stats Engine (Calculating the Winner)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python (with NumPy/SciPy)
  • Alternative Programming Languages: R, Julia
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 1. The “Resume Gold”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Statistics / Data Analysis
  • Software or Tool: SciPy Stats
  • Main Book: “Statistical Methods in Online A/B Testing” by Georgi Georgiev

What you’ll build: A script that reads your Telemetry data (from Project 3) and performs a Two-Sample Z-Test. It will output the Conversion Rate for A, the Conversion Rate for B, the p-value, and the 95% Confidence Interval.

Why it teaches A/B Testing: This is where the science happens. You’ll learn why “B has 5% more clicks” isn’t enough to make a decision. You’ll understand the “Null Hypothesis” and why a large sample size is required to “see” small changes.

Core challenges you’ll face:

  • Understanding Variance → Why does the “spread” of the data matter as much as the average?
  • The P-Value → Calculating it from scratch (or using a library) and explaining it in plain English.
  • Statistical Power → Calculating how many users you would have needed to see a specific effect.

Learning milestones:

  1. You can calculate a simple conversion rate (Success / Total).
  2. You can calculate the Standard Error for each group.
  3. You produce a P-Value and correctly interpret it (“Is it significant?”).
  4. You generate a Confidence Interval (e.g., “The lift is between 2% and 8%”).

Project 5: Multi-Armed Bandits (Dynamic Traffic Optimization)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: Julia, Rust
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Machine Learning / Reinforcement Learning
  • Software or Tool: Thompson Sampling
  • Main Book: “Experimentation for Engineers” by David Sweet

What you’ll build: An “Epsilon-Greedy” or “Thompson Sampling” agent that automatically shifts traffic towards the winning variant in real-time. Instead of a fixed 50/50 split, the system “learns” which one is better and minimizes “regret” (lost revenue from showing the bad version).

Why it teaches A/B Testing: Fixed A/B tests are expensive—you waste 50% of your traffic on a potentially inferior version for weeks. Bandits teach you about the Exploration vs. Exploitation tradeoff.

Core challenges you’ll face:

  • Balancing Exploration → How do you ensure you don’t pick a winner too early based on noise?
  • Thompson Sampling → Implementing a Bayesian approach where you sample from a Beta distribution.
  • Delayed Feedback → What if the conversion happens hours after the assignment?

Real World Outcome

A simulation where you see the traffic split evolve from 50/50 to 90/10 as the system discovers the better version.

Example Simulation Output:

Round 100: Group A (10% conv) | Group B (15% conv) | Split: 50/50
Round 500: Group A (9% conv)  | Group B (16% conv) | Split: 30/70
Round 1000: Group A (10% conv)| Group B (15% conv) | Split: 5/95
Total Regret Saved: $4,200 compared to fixed A/B test.

The Core Question You’re Answering

“If I already suspect Version B is better, why am I still showing Version A to 50% of my users?”


Thinking Exercise

The Epsilon Dilemma

If you set $\epsilon = 0.1$, you spend 10% of your time exploring and 90% exploiting the current winner.

Questions:

  • What happens if the world changes (e.g., a holiday season makes Version A better)?
  • Does a fixed A/B test handle a changing world better or worse than a Bandit?

Project 6: The Power & Sample Size Calculator (Planning for Success)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: TypeScript, R
  • Coolness Level: Level 3: Genuinely Clever
  • Business Potential: 2. The “Micro-SaaS / Pro Tool”
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Statistics / Product Management
  • Software or Tool: Power Analysis Math
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: A tool that tells a Product Manager: “To detect a 2% lift in conversion, you need 450,000 users. Based on your current traffic, this test will take 12 days.”

Why it teaches A/B Testing: Most experiments fail because they are “underpowered.” You’ll learn the relationship between Sample Size, Minimum Detectable Effect (MDE), and Statistical Power (1 - $\beta$).

Core challenges you’ll face:

  • The Inverse Math → Calculating N based on $\alpha$, $\beta$, and $\delta$.
  • Baseline Conversion → Why does it take more users to detect a change if your baseline is 1% vs 10%?

Project 7: The SRM (Sample Ratio Mismatch) Guardrail

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: SQL, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 3: Advanced
  • Knowledge Area: Data Quality / Debugging
  • Software or Tool: Chi-Squared Test
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: An automated alert system that runs a Chi-Squared Goodness of Fit test on your exposure counts. If you asked for a 50/50 split but got 49,000 vs 51,000, the tool calculates if this deviation is statistically “impossible” (indicating a bug in assignment or logging).

Why it teaches A/B Testing: SRM is the #1 reason for “Trust” issues in experimentation. It forces you to think like a detective. Is the variant slower? Is it crashing for Safari users? SRM reveals the hidden bugs.

Core Question:

“If my split isn’t exactly what I asked for, can I still trust the results?” (Hint: Usually no).


Project 8: The Experiment Results Dashboard (Visualizing Confidence)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: React / Python (Streamlit)
  • Alternative Programming Languages: Vue, Svelte
  • Coolness Level: Level 2: Practical but Forgettable
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 2: Intermediate
  • Knowledge Area: Data Visualization / Frontend
  • Software or Tool: Plotly / D3.js
  • Main Book: “Practical A/B Testing” by Leemay Nassery

What you’ll build: A dashboard that visualizes the results of an experiment. It shouldn’t just show a bar chart; it must show Confidence Intervals and a “Probability of Being Better” chart.

Why it teaches A/B Testing: You’ll learn how to communicate complex stats to non-technical stakeholders. You’ll realize that a “statistically significant” result can still be “practically insignificant” if the lift is tiny.

Key Design Question:

  • How do you visually represent “uncertainty”? (Error bars? Overlapping distributions?)

Project 9: Bayesian Experimentation Engine (Probability B > A)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python (with PyMC or simple Beta-Binomial math)
  • Alternative Programming Languages: Julia, R
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 4: Expert
  • Knowledge Area: Bayesian Statistics
  • Software or Tool: Conjugate Priors (Beta distribution)
  • Main Book: “Experimentation for Engineers” by David Sweet

What you’ll build: A stats engine that doesn’t output p-values, but instead answers: “What is the probability that Version B is at least 2% better than Version A?” and “How much am I expected to lose if I pick the wrong one?”

Why it teaches A/B Testing: Frequentist stats (p-values) are often unintuitive. Bayesian stats provide direct answers to business questions. You’ll learn about Priors, Posteriors, and the Beta Distribution.

Core challenges you’ll face:

  • Conjugate Priors → Using the Beta-Binomial model to update beliefs as data arrives.
  • Monte Carlo Simulation → Sampling from distributions to calculate probabilities.

Real World Outcome

A report that says: “There is a 94% probability that Variant B is better than Control.”

Example Output:

[Bayesian Analysis]
Control: Mean=0.12, Interval=[0.11, 0.13]
Variant: Mean=0.14, Interval=[0.12, 0.16]

Probability Variant > Control: 98.4%
Expected Lift: 16.6%
Risk of choosing Variant (Error): 0.02%

Project 10: The High-Performance Edge SDK (Speed is a Metric)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Rust / WebAssembly
  • Alternative Programming Languages: C++, Go
  • Coolness Level: Level 4: Hardcore Tech Flex
  • Business Potential: 4. The “Open Core” Infrastructure
  • Difficulty: Level 4: Expert
  • Knowledge Area: Low-Level Systems / Edge Computing
  • Software or Tool: Shared Memory / FlatBuffers
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: A compiled SDK that can perform 1,000,000 assignments per second. It must parse a JSON config into a highly efficient binary structure (like a Radix Tree or a Bitmask) and perform hashing in zero-allocation memory.

Why it teaches A/B Testing: Experiments themselves can slow down your app, causing a “negative lift” just because of latency. You’ll learn how to build an “Invisible” infrastructure.

Core Question:

“How do I make the ‘Decision Engine’ so fast that the user never knows it happened?”


Project 11: Sequential Testing (Solving the “Peeking” Problem)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python
  • Alternative Programming Languages: R
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 3. The “Service & Support” Model
  • Difficulty: Level 5: Master
  • Knowledge Area: Advanced Statistics / Sequential Analysis
  • Software or Tool: SPRT (Sequential Probability Ratio Test)
  • Main Book: “Statistical Methods in Online A/B Testing” by Georgi Georgiev

What you’ll build: A stats engine that allows you to check the results every hour and stop the test early if a result is found—without increasing your “False Positive Rate.”

Why it teaches A/B Testing: In standard A/B testing, “Peeking” at the results before the sample size is reached is a cardinal sin that leads to false winners. This project teaches you how to use Alpha-Spending functions or mSPRT to enable “Always Valid” p-values.

Thinking Exercise:

  • If you flip a coin 10 times and get 7 heads, is it biased? What if you keep flipping until you reach a point where heads are “significantly” more frequent, then stop? Why is the second approach dishonest?

Project 12: Cluster/Network Randomization (Handling Interference)

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Python / SQL
  • Alternative Programming Languages: Java
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Graph Theory / Advanced Experiment Design
  • Software or Tool: Network Clustering Algorithms
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al. (Ch. 15)

What you’ll build: An assignment engine for a social network or marketplace where users interact. If User A is in the “New Chat” experiment but User B is not, their interaction is tainted. You will build a system that randomizes by City or Cluster rather than by individual ID.

Why it teaches A/B Testing: You’ll learn about SUTVA (Stable Unit Treatment Value Assumption) and what happens when it’s violated. This is the “final boss” of experimentation design.


Project Comparison Table

Project Difficulty Time Depth of Understanding Fun Factor
1. Deterministic Allocator Level 1 Weekend Medium ★★★★☆
2. Feature Toggle Engine Level 2 1 Week Medium ★★★☆☆
3. Telemetry Ingestor Level 2 1 Week High ★★☆☆☆
4. Stats Engine (Freq) Level 3 2 Weeks Very High ★★★★☆
5. Multi-Armed Bandits Level 3 2 Weeks Very High ★★★★★
6. Power Calculator Level 2 Weekend High ★★★☆☆
7. SRM Detector Level 3 1 Week High ★★★★☆
8. Results Dashboard Level 2 1 Week Medium ★★★☆☆
9. Bayesian Engine Level 4 2 Weeks Very High ★★★★☆
10. Edge SDK Level 4 2 Weeks High ★★★★★
11. Sequential Testing Level 5 1 Month Master ★★★★☆
12. Cluster Randomizer Level 5 1 Month Master ★★★☆☆

Recommendation

Where should you start?

  1. If you are an Engineer: Start with Project 1 (Allocator) and Project 2 (Toggle Engine). These give you the “infrastructure” mindset. Then move to Project 10 (Edge SDK) to see how to make it production-grade.
  2. If you are a Data Scientist: Start with Project 4 (Stats Engine) and Project 6 (Power Calculator). This is where you’ll find the most “aha!” moments about the math.
  3. If you want to build a product: Focus on the Full-Stack Project below.

Final Overall Project: “Omni-Test” – The End-to-End Experimentation SaaS

Following the same pattern above, here is your capstone challenge.

  • File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
  • Main Programming Language: Go (Backend) + React (Frontend) + Python (Stats)
  • Coolness Level: Level 5: Pure Magic (Super Cool)
  • Business Potential: 5. The “Industry Disruptor”
  • Difficulty: Level 5: Master
  • Knowledge Area: Full Stack / DevOps / Statistics
  • Main Book: “Trustworthy Online Controlled Experiments” by Kohavi et al.

What you’ll build: A complete, self-hosted A/B testing platform that includes:

  1. A Web UI to create experiments, define splits, and set targeting rules.
  2. A Go-based Proxy or Sidecar that intercepts requests and injects feature flags.
  3. A Data Pipeline (Kafka + ClickHouse) that ingests millions of events.
  4. An Automated Analyst that runs daily stats, checks for SRM, and sends Slack alerts when a winner is found.

Why it teaches A/B Testing: This project forces you to integrate every concept. You’ll deal with real-world problems like “How do I handle experiment collisions?” and “How do I make the UI intuitive for a Product Manager?”

Success Criteria:

  • You can create an experiment in the UI.
  • A separate app uses your SDK/Proxy and changes behavior.
  • You can simulate 10,000 users and see a “Significant” result appear in your dashboard.

Summary

This learning path covers A/B Testing & Experimentation Platforms through 13 hands-on projects. Here’s the complete list:

# Project Name Main Language Difficulty Time Estimate
1 Deterministic Allocator Go Level 1 Weekend
2 Feature Toggle Engine TypeScript Level 2 1 Week
3 Telemetry Ingestor Python Level 2 1 Week
4 Stats Engine (Frequentist) Python Level 3 2 Weeks
5 Multi-Armed Bandits Python Level 3 2 Weeks
6 Power Calculator Python Level 2 Weekend
7 SRM Detector Python Level 3 1 Week
8 Results Dashboard React Level 2 1 Week
9 Bayesian Engine Python Level 4 2 Weeks
10 Edge SDK Rust/Wasm Level 4 2 Weeks
11 Sequential Testing Python Level 5 1 Month
12 Cluster Randomization Python Level 5 1 Month
13 “Omni-Test” SaaS Polyglot Level 5 3 Months+

For beginners: Start with projects #1, #2, #3, and #6. For intermediate: Focus on #4, #7, #8, and #10. For advanced: Master #5, #9, #11, and #12.

Expected Outcomes

After completing these projects, you will:

  • Understand the binary mechanics of deterministic user assignment.
  • Be able to build scalable data pipelines for telemetry ingestion.
  • Master the statistical rigor (p-values, CI, power) required for trustworthy results.
  • Understand the business tradeoffs between fixed experiments and dynamic bandits.
  • Be capable of architecting a production-grade experimentation platform used by elite engineering teams.

You’ll have built a suite of working tools that demonstrate deep understanding of Experimentation from first principles.