LEARN AB TESTING EXPERIMENTATION PLATFORMS
In 2007, an engineer at Google wondered if a different shade of blue would increase clicks on search results. They tested 41 different shades. The result? An extra $200 million in annual revenue.
Learn A/B Testing & Experimentation Platforms: From Zero to Platform Architect
Goal: Deeply understand the engineering and statistics behind experimentation platformsâhow to consistently assign users to treatments, track metrics at scale, and automate statistical rigor to turn raw data into confident product decisions. You will build a system that moves beyond âguessingâ to a scientific, automated feedback loop.
Why Experimentation Platforms Matter
In 2007, an engineer at Google wondered if a different shade of blue would increase clicks on search results. They tested 41 different shades. The result? An extra $200 million in annual revenue.
But A/B testing isnât just about colors. Itâs the engine behind Amazonâs recommendation system, Netflixâs artwork selection, and Uberâs pricing algorithms. Without a robust platform, organizations fall into the âHiPPOâ trap (Highest Paid Personâs Opinion). A platform democratizes data, allowing any idea to be tested against reality.
Building an experimentation platform is a unique engineering challenge that sits at the intersection of:
- High-performance systems: Deciding a treatment in <5ms at the edge.
- Distributed systems: Ensuring âstickyâ assignments across millions of devices.
- Statistics: Automating complex math so developers donât have to be PhDs.
- Data Engineering: Processing billions of events to calculate precise metrics.
Core Concept Analysis
The Anatomy of an Experiment
At its heart, an experimentation platform is a system that maps a User to a Treatment based on a Variable within a specific Context.
[ User ID ] + [ Experiment ID ]
| |
v v
+--------------------------+
| Deterministic Hash | (e.g., MurmurHash3)
+--------------------------+
|
v
[ Numeric Bucket (0-99) ]
|
+------+------+------+
| | |
[ 0-49 ] [ 50-99 ] |
| | |
v v v
Treatment A Treatment B (Not in Experiment)
(Control) (Variant)
The Platform Architecture
A mature platform consists of four major layers:
- Assignment (The Brain): Fast, deterministic logic that tells the client which version to show.
- Telemetry (The Nerves): High-throughput ingestion of âexposureâ events (who saw what) and âmetricâ events (what they did).
- Analytics (The Digestion): Aggregation of events into cohorts (Group A vs. Group B).
- Statistics (The Judge): Comparison of groups to determine if the difference is ârealâ or just ânoiseâ (p-values, confidence intervals).
CLIENT SIDE BACKEND / DATA WAREHOUSE
+-------------+ +--------------------------+
| SDK |---(1)---> | Assignment Engine |
| (Decision) | | (Feature Flags/Toggles) |
+-------------+ +--------------------------+
|
| (2) Exposure Event ("User X saw Variant B")
v
+-------------+ +--------------------------+
| Telemetry |---(3)---> | Data Pipeline |
| Ingestion | | (Kafka/Kinesis) |
+-------------+ +--------------------------+
|
v
+--------------------------+
| Stats Engine |
| (P-values, CI, Power) |
+--------------------------+
Concept Summary Table
| Concept Cluster | What You Need to Internalize |
|---|---|
| Deterministic Hashing | Assignments must be âstickyâ and uniform. Hashing (not random numbers) ensures the same user gets the same variant every time without storing state. |
| Exposure vs. Event | You can only analyze users who actually saw the experiment. Distinguishing âExposuresâ from general âEventsâ is the foundation of valid analysis. |
| Statistical Significance | The probability that the observed difference is NOT due to random chance. p < 0.05 is the industry standard but often misunderstood. |
| Feature Toggling | The ability to decouple deployment from release. Every experiment is a toggle, but not every toggle is an experiment. |
| SRM (Sample Ratio Mismatch) | A critical bug-detector. If you expect a 50/50 split but get 48/52, your data is likely tainted. |
Deep Dive Reading by Concept
This section maps each concept to specific book chapters. Read these alongside the projects to build strong mental models.
Statistical Foundations
| Concept | Book & Chapter |
|---|---|
| The P-Value & Null Hypothesis | âTrustworthy Online Controlled Experimentsâ by Kohavi et al. â Ch. 17: âStatistics Behind Online Controlled Experimentsâ |
| Confidence Intervals | âStatistical Methods in Online A/B Testingâ by Georgi Georgiev â Ch. 3: âStatistical Significance and Confidenceâ |
| Bayesian vs. Frequentist | âExperimentation for Engineersâ by David Sweet â Ch. 8: âBayesian Optimizationâ |
Engineering & Platform Design
| Concept | Book & Chapter |
|---|---|
| Assignment Mechanisms | âTrustworthy Online Controlled Experimentsâ by Kohavi et al. â Ch. 4: âExperimentation Platform and Analysisâ |
| Feature Flags & Toggles | âPractical A/B Testingâ by Leemay Nassery â Ch. 2: âThe Engineering of Feature Flagsâ |
| Telemetry & Data Quality | âTrustworthy Online Controlled Experimentsâ by Kohavi et al. â Ch. 13: âData Qualityâ |
Essential Reading Order
- The Fundamentals (Week 1):
- Trustworthy Online Controlled Experiments Ch. 1 & 2 (Introduction and Strategy)
- Experimentation for Engineers Ch. 1 & 2 (Basics of Experiments)
- The Math (Week 2):
- Statistical Methods in Online A/B Testing Ch. 1-3 (Significance and Power)
- The Architecture (Week 3):
- Practical A/B Testing Ch. 4 (Building vs. Buying)
Project List
Projects are ordered from fundamental understanding to advanced implementations.
Project 1: The Deterministic Allocator (The Heart of Sticky Assignments)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Go
- Alternative Programming Languages: Python, Rust, C++
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 1: Beginner
- Knowledge Area: Distributed Systems / Hashing
- Software or Tool: MurmurHash3
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: A library that takes a UserID and an ExperimentID and deterministically returns a bucket (0-99). You will then run a simulation to verify that 100,000 unique IDs are distributed uniformly across treatments.
Why it teaches A/B Testing: Most beginners think A/B testing uses rand(). This project teaches why rand() is forbidden. Youâll learn how to ensure a user stays in the same group across sessions without a database, and how to âsaltâ hashes to avoid correlation between different experiments.
Core challenges youâll face:
- Ensuring Uniformity â Maps to preventing biased splits (SRM).
- Independence of Experiments â Maps to using Salts to ensure a user in Group A for Experiment 1 isnât always in Group A for Experiment 2.
- Performance â Maps to deciding assignments in microseconds.
Key Concepts:
- Deterministic Hashing: [MurmurHash3 Algorithm - Wikipedia]
- Bucketing Strategies: âTrustworthy Online Controlled Experimentsâ Ch. 4
Difficulty: Beginner Time estimate: Weekend Prerequisites: Basic programming, understanding of strings and integers.
Real World Outcome
Youâll have a CLI tool that can split users into groups and a validation script that proves the split is fair.
Example Output:
$ ./allocator --user "user_123" --experiment "new_checkout_flow"
User: user_123 | Experiment: new_checkout_flow | Bucket: 42 | Group: VARIANT_B
$ ./allocator --validate --users 100000 --split 50/50
Running simulation for 100,000 users...
[Results]
Group A (Control): 50,042 (50.04%)
Group B (Variant): 49,958 (49.96%)
Standard Deviation: 0.08%
Verdict: UNIFORM DISTRIBUTION VERIFIED
The Core Question Youâre Answering
âHow can I guarantee a user sees the same version of my app every time without saving their preference in a database?â
Before you write any code, realize that every database lookup adds latency. In high-scale systems, we âcalculateâ the truth rather than âlooking it up.â
Concepts You Must Understand First
Stop and research these before coding:
- Hash Functions (Non-Cryptographic)
- Why use MurmurHash or CityHash instead of SHA-256?
- What is the âAvalanche Effectâ?
- Book Reference: âData Structures and Algorithmsâ (any) - Hashing section.
- The Modulo Operator
- How do you turn a massive 64-bit integer into a number between 0 and 99?
- What happens if the hash isnât perfectly uniform?
Questions to Guide Your Design
Before implementing, think through these:
- Salting
- If I use the same hash for two experiments, will the same users always be in the same groups?
- How should I combine
UserIDandExperimentIDbefore hashing? (Concatenation? Delimiters?)
- Scaling
- If I change the split from 50/50 to 10/90, how does the bucket logic change?
Thinking Exercise
The âCorrelation Trapâ
Imagine two experiments:
bg_color: (Red vs Blue)font_size: (Small vs Large)
If you just hash the UserID, the users who see the Red background will always see the Small font.
Questions while analyzing:
- How do you âdecorrelateâ these two experiments using only a hash function?
- What happens if you hash
UserID + ExperimentID? Try to trace the bits of two similar strings.
The Interview Questions Theyâll Ask
- âWhy wouldnât you use
Math.random()to assign users to groups?â - âWhat is a âsticky assignmentâ and why is it important for user experience?â
- âHow do you handle assignments for users who arenât logged in?â
- âIf you see a 45/55 split on a 50/50 experiment, whatâs your first troubleshooting step?â
- âExplain the tradeoffs between server-side and client-side assignment.â
Hints in Layers
Hint 1: The Input Combine your User ID and Experiment ID into a single string. This is your âunique keyâ for this specific assignment.
Hint 2: The Hash Use a library for MurmurHash3 (32-bit or 128-bit). Itâs fast and has great distribution properties.
Hint 3: The Bucket
Convert the hash output (a big number) to a positive integer, then use % 100. This gives you a bucket from 0 to 99.
Hint 4: The Logic If bucket < 50, itâs Group A. Otherwise, itâs Group B. For a 10/90 split, if bucket < 10, itâs Group A.
Books That Will Help
| Topic | Book | Chapter |
|---|---|---|
| Hashing for Experiments | âTrustworthy Online Controlled Experimentsâ | Ch. 4.2 |
| Implementation Patterns | âExperimentation for Engineersâ | Ch. 3 |
Project 2: Feature Toggle Engine with Targeting (Dynamic Control)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: TypeScript/Node.js
- Alternative Programming Languages: Go, Python
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 2. The âMicro-SaaS / Pro Toolâ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Web Rendering / Rule Engines
- Software or Tool: JSON-based Rules
- Main Book: âPractical A/B Testingâ by Leemay Nassery
What youâll build: A rule engine that evaluates whether a feature should be âONâ or âOFFâ based on user attributes (e.g., country == 'US', app_version >= '2.0', subscription == 'premium').
Why it teaches A/B Testing: Experiments are just a special type of Feature Toggle. Before you can split traffic, you need to be able to target specific âcohortsâ (e.g., âonly test this on 10% of users in Canadaâ).
Core challenges youâll face:
- Rule Evaluation â Building a parser for logical conditions (AND/OR).
- Attribute Matching â Handling different data types (SemVer, lists, booleans).
- Fallbacks â Ensuring the app doesnât crash if the config is missing.
Difficulty: Intermediate Time estimate: 1 week Prerequisites: Understanding of JSON, basic Boolean logic.
Real World Outcome
A âConfig Serviceâ that returns a set of enabled features for a specific user.
Example Usage (Client):
const user = { id: "u1", country: "US", version: "1.5.0" };
const flags = engine.getFlags(user);
if (flags.isEnabled("new_dashboard")) {
renderNewDashboard();
}
Example Output (Server Logs):
Evaluating 'new_dashboard' for user u1...
- Rule 1 (Country is US): MATCH
- Rule 2 (Version >= 2.0.0): FAIL
- Final Result: OFF
The Core Question Youâre Answering
âHow can I change my appâs behavior for specific users without redeploying code?â
Concepts You Must Understand First
Stop and research these before coding:
- Decoupling Deployment from Release
- What is a âDark Launchâ?
- What is a âCanary Deploymentâ?
- JSON Schemas for Rules
- How do you represent âIf user is in [A, B, C]â in JSON?
Thinking Exercise
The Override Problem
Imagine a developer needs to test a feature that is currently âOFFâ for everyone.
Questions:
- How do you design your rules so a âDeveloper IDâ always sees the feature regardless of the general rules?
- Does the âOverrideâ rule go at the top or the bottom of your JSON config? Why?
The Interview Questions Theyâll Ask
- âWhatâs the difference between a Feature Toggle and an A/B Test?â
- âHow do you prevent âTechnical Debtâ when using feature flags?â
- âExplain how you would handle âFlag Dependenciesâ (e.g., Feature B requires Feature A to be ON).â
- âWhat are the performance implications of evaluating complex rules on every request?â
Project 3: The Telemetry Ingestor (Tracking the âTruthâ)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python (with FastAPI)
- Alternative Programming Languages: Go, Java
- Coolness Level: Level 1: Pure Corporate Snoozefest
- Business Potential: 3. The âService & Supportâ Model
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Engineering / Telemetry
- Software or Tool: Redis or SQLite
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: A high-speed API that accepts two types of events:
- Exposure: âUser X saw Experiment Y, Variant Bâ
- Action/Conversion: âUser X clicked âPurchaseââ
You will store these in a way that allows you to calculate the âConversion Rateâ for each variant later.
Why it teaches A/B Testing: Statistics are âgarbage in, garbage out.â Youâll learn why tracking who saw the experiment is more important than tracking everyone. Youâll also grapple with the âAt-Least-Onceâ vs. âExactly-Onceâ delivery problems.
Core challenges youâll face:
- The Attribution Window â A purchase today might be because of an experiment seen 3 days ago. How do you link them?
- Idempotency â What if a client sends the same exposure event twice?
- Data Volume â Experiments generate 10x more logs than normal features. How do you keep it efficient?
Real World Outcome
A database (or CSV) containing a clean audit log of user behavior mapped to experiment groups.
Example Log Record: | timestamp | user_id | event_type | experiment_id | variant | value | |ââââ|âââ|ââââ|âââââ|âââ|ââ-| | 10:00:01 | u1 | EXPOSURE | checkout_btn | B | null | | 10:05:22 | u1 | CONVERSION | null | null | 49.99 |
Project 4: The Stats Engine (Calculating the Winner)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python (with NumPy/SciPy)
- Alternative Programming Languages: R, Julia
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 1. The âResume Goldâ
- Difficulty: Level 3: Advanced
- Knowledge Area: Statistics / Data Analysis
- Software or Tool: SciPy Stats
- Main Book: âStatistical Methods in Online A/B Testingâ by Georgi Georgiev
What youâll build: A script that reads your Telemetry data (from Project 3) and performs a Two-Sample Z-Test. It will output the Conversion Rate for A, the Conversion Rate for B, the p-value, and the 95% Confidence Interval.
Why it teaches A/B Testing: This is where the science happens. Youâll learn why âB has 5% more clicksâ isnât enough to make a decision. Youâll understand the âNull Hypothesisâ and why a large sample size is required to âseeâ small changes.
Core challenges youâll face:
- Understanding Variance â Why does the âspreadâ of the data matter as much as the average?
- The P-Value â Calculating it from scratch (or using a library) and explaining it in plain English.
- Statistical Power â Calculating how many users you would have needed to see a specific effect.
Learning milestones:
- You can calculate a simple conversion rate (Success / Total).
- You can calculate the Standard Error for each group.
- You produce a P-Value and correctly interpret it (âIs it significant?â).
- You generate a Confidence Interval (e.g., âThe lift is between 2% and 8%â).
Project 5: Multi-Armed Bandits (Dynamic Traffic Optimization)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python
- Alternative Programming Languages: Julia, Rust
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The âIndustry Disruptorâ
- Difficulty: Level 3: Advanced
- Knowledge Area: Machine Learning / Reinforcement Learning
- Software or Tool: Thompson Sampling
- Main Book: âExperimentation for Engineersâ by David Sweet
What youâll build: An âEpsilon-Greedyâ or âThompson Samplingâ agent that automatically shifts traffic towards the winning variant in real-time. Instead of a fixed 50/50 split, the system âlearnsâ which one is better and minimizes âregretâ (lost revenue from showing the bad version).
Why it teaches A/B Testing: Fixed A/B tests are expensiveâyou waste 50% of your traffic on a potentially inferior version for weeks. Bandits teach you about the Exploration vs. Exploitation tradeoff.
Core challenges youâll face:
- Balancing Exploration â How do you ensure you donât pick a winner too early based on noise?
- Thompson Sampling â Implementing a Bayesian approach where you sample from a Beta distribution.
- Delayed Feedback â What if the conversion happens hours after the assignment?
Real World Outcome
A simulation where you see the traffic split evolve from 50/50 to 90/10 as the system discovers the better version.
Example Simulation Output:
Round 100: Group A (10% conv) | Group B (15% conv) | Split: 50/50
Round 500: Group A (9% conv) | Group B (16% conv) | Split: 30/70
Round 1000: Group A (10% conv)| Group B (15% conv) | Split: 5/95
Total Regret Saved: $4,200 compared to fixed A/B test.
The Core Question Youâre Answering
âIf I already suspect Version B is better, why am I still showing Version A to 50% of my users?â
Thinking Exercise
The Epsilon Dilemma
If you set $\epsilon = 0.1$, you spend 10% of your time exploring and 90% exploiting the current winner.
Questions:
- What happens if the world changes (e.g., a holiday season makes Version A better)?
- Does a fixed A/B test handle a changing world better or worse than a Bandit?
Project 6: The Power & Sample Size Calculator (Planning for Success)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python
- Alternative Programming Languages: TypeScript, R
- Coolness Level: Level 3: Genuinely Clever
- Business Potential: 2. The âMicro-SaaS / Pro Toolâ
- Difficulty: Level 2: Intermediate
- Knowledge Area: Statistics / Product Management
- Software or Tool: Power Analysis Math
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: A tool that tells a Product Manager: âTo detect a 2% lift in conversion, you need 450,000 users. Based on your current traffic, this test will take 12 days.â
Why it teaches A/B Testing: Most experiments fail because they are âunderpowered.â Youâll learn the relationship between Sample Size, Minimum Detectable Effect (MDE), and Statistical Power (1 - $\beta$).
Core challenges youâll face:
- The Inverse Math â Calculating N based on $\alpha$, $\beta$, and $\delta$.
- Baseline Conversion â Why does it take more users to detect a change if your baseline is 1% vs 10%?
Project 7: The SRM (Sample Ratio Mismatch) Guardrail
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python
- Alternative Programming Languages: SQL, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 3. The âService & Supportâ Model
- Difficulty: Level 3: Advanced
- Knowledge Area: Data Quality / Debugging
- Software or Tool: Chi-Squared Test
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: An automated alert system that runs a Chi-Squared Goodness of Fit test on your exposure counts. If you asked for a 50/50 split but got 49,000 vs 51,000, the tool calculates if this deviation is statistically âimpossibleâ (indicating a bug in assignment or logging).
Why it teaches A/B Testing: SRM is the #1 reason for âTrustâ issues in experimentation. It forces you to think like a detective. Is the variant slower? Is it crashing for Safari users? SRM reveals the hidden bugs.
Core Question:
âIf my split isnât exactly what I asked for, can I still trust the results?â (Hint: Usually no).
Project 8: The Experiment Results Dashboard (Visualizing Confidence)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: React / Python (Streamlit)
- Alternative Programming Languages: Vue, Svelte
- Coolness Level: Level 2: Practical but Forgettable
- Business Potential: 4. The âOpen Coreâ Infrastructure
- Difficulty: Level 2: Intermediate
- Knowledge Area: Data Visualization / Frontend
- Software or Tool: Plotly / D3.js
- Main Book: âPractical A/B Testingâ by Leemay Nassery
What youâll build: A dashboard that visualizes the results of an experiment. It shouldnât just show a bar chart; it must show Confidence Intervals and a âProbability of Being Betterâ chart.
Why it teaches A/B Testing: Youâll learn how to communicate complex stats to non-technical stakeholders. Youâll realize that a âstatistically significantâ result can still be âpractically insignificantâ if the lift is tiny.
Key Design Question:
- How do you visually represent âuncertaintyâ? (Error bars? Overlapping distributions?)
Project 9: Bayesian Experimentation Engine (Probability B > A)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python (with PyMC or simple Beta-Binomial math)
- Alternative Programming Languages: Julia, R
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 5. The âIndustry Disruptorâ
- Difficulty: Level 4: Expert
- Knowledge Area: Bayesian Statistics
- Software or Tool: Conjugate Priors (Beta distribution)
- Main Book: âExperimentation for Engineersâ by David Sweet
What youâll build: A stats engine that doesnât output p-values, but instead answers: âWhat is the probability that Version B is at least 2% better than Version A?â and âHow much am I expected to lose if I pick the wrong one?â
Why it teaches A/B Testing: Frequentist stats (p-values) are often unintuitive. Bayesian stats provide direct answers to business questions. Youâll learn about Priors, Posteriors, and the Beta Distribution.
Core challenges youâll face:
- Conjugate Priors â Using the Beta-Binomial model to update beliefs as data arrives.
- Monte Carlo Simulation â Sampling from distributions to calculate probabilities.
Real World Outcome
A report that says: âThere is a 94% probability that Variant B is better than Control.â
Example Output:
[Bayesian Analysis]
Control: Mean=0.12, Interval=[0.11, 0.13]
Variant: Mean=0.14, Interval=[0.12, 0.16]
Probability Variant > Control: 98.4%
Expected Lift: 16.6%
Risk of choosing Variant (Error): 0.02%
Project 10: The High-Performance Edge SDK (Speed is a Metric)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Rust / WebAssembly
- Alternative Programming Languages: C++, Go
- Coolness Level: Level 4: Hardcore Tech Flex
- Business Potential: 4. The âOpen Coreâ Infrastructure
- Difficulty: Level 4: Expert
- Knowledge Area: Low-Level Systems / Edge Computing
- Software or Tool: Shared Memory / FlatBuffers
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: A compiled SDK that can perform 1,000,000 assignments per second. It must parse a JSON config into a highly efficient binary structure (like a Radix Tree or a Bitmask) and perform hashing in zero-allocation memory.
Why it teaches A/B Testing: Experiments themselves can slow down your app, causing a ânegative liftâ just because of latency. Youâll learn how to build an âInvisibleâ infrastructure.
Core Question:
âHow do I make the âDecision Engineâ so fast that the user never knows it happened?â
Project 11: Sequential Testing (Solving the âPeekingâ Problem)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python
- Alternative Programming Languages: R
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 3. The âService & Supportâ Model
- Difficulty: Level 5: Master
- Knowledge Area: Advanced Statistics / Sequential Analysis
- Software or Tool: SPRT (Sequential Probability Ratio Test)
- Main Book: âStatistical Methods in Online A/B Testingâ by Georgi Georgiev
What youâll build: A stats engine that allows you to check the results every hour and stop the test early if a result is foundâwithout increasing your âFalse Positive Rate.â
Why it teaches A/B Testing: In standard A/B testing, âPeekingâ at the results before the sample size is reached is a cardinal sin that leads to false winners. This project teaches you how to use Alpha-Spending functions or mSPRT to enable âAlways Validâ p-values.
Thinking Exercise:
- If you flip a coin 10 times and get 7 heads, is it biased? What if you keep flipping until you reach a point where heads are âsignificantlyâ more frequent, then stop? Why is the second approach dishonest?
Project 12: Cluster/Network Randomization (Handling Interference)
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Python / SQL
- Alternative Programming Languages: Java
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The âIndustry Disruptorâ
- Difficulty: Level 5: Master
- Knowledge Area: Graph Theory / Advanced Experiment Design
- Software or Tool: Network Clustering Algorithms
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al. (Ch. 15)
What youâll build: An assignment engine for a social network or marketplace where users interact. If User A is in the âNew Chatâ experiment but User B is not, their interaction is tainted. You will build a system that randomizes by City or Cluster rather than by individual ID.
Why it teaches A/B Testing: Youâll learn about SUTVA (Stable Unit Treatment Value Assumption) and what happens when itâs violated. This is the âfinal bossâ of experimentation design.
Project Comparison Table
| Project | Difficulty | Time | Depth of Understanding | Fun Factor |
|---|---|---|---|---|
| 1. Deterministic Allocator | Level 1 | Weekend | Medium | â â â â â |
| 2. Feature Toggle Engine | Level 2 | 1 Week | Medium | â â â ââ |
| 3. Telemetry Ingestor | Level 2 | 1 Week | High | â â âââ |
| 4. Stats Engine (Freq) | Level 3 | 2 Weeks | Very High | â â â â â |
| 5. Multi-Armed Bandits | Level 3 | 2 Weeks | Very High | â â â â â |
| 6. Power Calculator | Level 2 | Weekend | High | â â â ââ |
| 7. SRM Detector | Level 3 | 1 Week | High | â â â â â |
| 8. Results Dashboard | Level 2 | 1 Week | Medium | â â â ââ |
| 9. Bayesian Engine | Level 4 | 2 Weeks | Very High | â â â â â |
| 10. Edge SDK | Level 4 | 2 Weeks | High | â â â â â |
| 11. Sequential Testing | Level 5 | 1 Month | Master | â â â â â |
| 12. Cluster Randomizer | Level 5 | 1 Month | Master | â â â ââ |
Recommendation
Where should you start?
- If you are an Engineer: Start with Project 1 (Allocator) and Project 2 (Toggle Engine). These give you the âinfrastructureâ mindset. Then move to Project 10 (Edge SDK) to see how to make it production-grade.
- If you are a Data Scientist: Start with Project 4 (Stats Engine) and Project 6 (Power Calculator). This is where youâll find the most âaha!â moments about the math.
- If you want to build a product: Focus on the Full-Stack Project below.
Final Overall Project: âOmni-Testâ â The End-to-End Experimentation SaaS
Following the same pattern above, here is your capstone challenge.
- File: LEARN_AB_TESTING_EXPERIMENTATION_PLATFORMS.md
- Main Programming Language: Go (Backend) + React (Frontend) + Python (Stats)
- Coolness Level: Level 5: Pure Magic (Super Cool)
- Business Potential: 5. The âIndustry Disruptorâ
- Difficulty: Level 5: Master
- Knowledge Area: Full Stack / DevOps / Statistics
- Main Book: âTrustworthy Online Controlled Experimentsâ by Kohavi et al.
What youâll build: A complete, self-hosted A/B testing platform that includes:
- A Web UI to create experiments, define splits, and set targeting rules.
- A Go-based Proxy or Sidecar that intercepts requests and injects feature flags.
- A Data Pipeline (Kafka + ClickHouse) that ingests millions of events.
- An Automated Analyst that runs daily stats, checks for SRM, and sends Slack alerts when a winner is found.
Why it teaches A/B Testing: This project forces you to integrate every concept. Youâll deal with real-world problems like âHow do I handle experiment collisions?â and âHow do I make the UI intuitive for a Product Manager?â
Success Criteria:
- You can create an experiment in the UI.
- A separate app uses your SDK/Proxy and changes behavior.
- You can simulate 10,000 users and see a âSignificantâ result appear in your dashboard.
Summary
This learning path covers A/B Testing & Experimentation Platforms through 13 hands-on projects. Hereâs the complete list:
| # | Project Name | Main Language | Difficulty | Time Estimate |
|---|---|---|---|---|
| 1 | Deterministic Allocator | Go | Level 1 | Weekend |
| 2 | Feature Toggle Engine | TypeScript | Level 2 | 1 Week |
| 3 | Telemetry Ingestor | Python | Level 2 | 1 Week |
| 4 | Stats Engine (Frequentist) | Python | Level 3 | 2 Weeks |
| 5 | Multi-Armed Bandits | Python | Level 3 | 2 Weeks |
| 6 | Power Calculator | Python | Level 2 | Weekend |
| 7 | SRM Detector | Python | Level 3 | 1 Week |
| 8 | Results Dashboard | React | Level 2 | 1 Week |
| 9 | Bayesian Engine | Python | Level 4 | 2 Weeks |
| 10 | Edge SDK | Rust/Wasm | Level 4 | 2 Weeks |
| 11 | Sequential Testing | Python | Level 5 | 1 Month |
| 12 | Cluster Randomization | Python | Level 5 | 1 Month |
| 13 | âOmni-Testâ SaaS | Polyglot | Level 5 | 3 Months+ |
Recommended Learning Path
For beginners: Start with projects #1, #2, #3, and #6. For intermediate: Focus on #4, #7, #8, and #10. For advanced: Master #5, #9, #11, and #12.
Expected Outcomes
After completing these projects, you will:
- Understand the binary mechanics of deterministic user assignment.
- Be able to build scalable data pipelines for telemetry ingestion.
- Master the statistical rigor (p-values, CI, power) required for trustworthy results.
- Understand the business tradeoffs between fixed experiments and dynamic bandits.
- Be capable of architecting a production-grade experimentation platform used by elite engineering teams.
Youâll have built a suite of working tools that demonstrate deep understanding of Experimentation from first principles.