A/B testing, also known as split testing, is a statistical method used in data science to compare two versions of a product, webpage, or marketing campaign to determine which one performs better based on a specific metric. This approach allows data scientists and marketers to make data-driven decisions rather than relying on intuition or guesswork.
A/B Testing General Procedure
- Problem Statement
- Hypothesis Testing
- Design the Experiment
- Run the Experiment
- Validity Checks
- Interpret the Results
- Launch Decision
Tips for Designing a Good Experiment
- Focus on the business goal first (user journey).
- Use the user funnel to identify the success metric.
- A success metric must be: measurable, attributable, sensitive and timely.
Example
Experiment Setup
A web store wants to change the product ranking recommendation system.
- Success Metric: revenue per day per user.
- Null Hypothesis (): the average revenue per day per user between the baseline and variant ranking algorithms are the same.
- Alternative Hypothesis (): the average revenue per day per user between the baseline and variant ranking algorithms are different.
- Significance Level (): If the p-value is , then reject and conclude .
- Statistical Power (): The probability of detecting an effect if the alternative hypothesis is true.
- Minimum Detectable Effect ( lift): If the change is at least 1% higher in revenue per day per user then it is practically significant.
Running the Experiment
- Set the randomization unit: user
- Target population in the experiment: visitors who searches a product
- Determine the sample size: , where is the sample standard deviation and is the difference between the control and treatment (based on and )
- Define the experiment duration
- Running
- Set up instruments and data pipelines to collect data
- Avoid peeking p-values
- Validity checks (search for bias)
- Check for instrumentation effects
- External factors
- Selection bias
- Sample ratio mismatch
- Novelty effect (e.g. segment by new and old visitors)
- Interpret the results
- Launch decision
- Metric trade-offs
- Cost of launching
Statistical Tests
Discrete Metrics
- Fisher’s Exact Test
- Pearson’s Chi-Squared Test
Continous Metrics
- Z-Test
- Student’s T-Test
- Welch’s T-Test
- Mann-Whitney U Test
Choosing the Right Test
flowchart TD
A{Discrete metric ?} -->|Yes| B{Large sample size ?}
B -->|Yes| C[Pearson's X2 Test]
B -->|No| D[Fisher's Exact Test]
A -->|No| E{Large sample size ?}
E -->|Yes| F{Variances known ?}
E -->|No| G{Normal distributions ?}
G -->|No| H[Mann-Whitney U Test]
G -->|Yes| F
F -->|Yes| J[Z-Test]
F -->|No| K{Similar variances ?}
K -->|Yes| L[Student's T-Test]
K -->|No| M[Welch's T-Test]
A/B Testing Example - Simulating Click Data
Click | Group Label | |
---|---|---|
0 | 1 | Experiment |
1 | 0 | Experiment |
2 | 1 | Experiment |
3 | 0 | Experiment |
4 | 1 | Experiment |
... | ... | ... |
19995 | 1 | Control |
19996 | 0 | Control |
19997 | 0 | Control |
19998 | 0 | Control |
19999 | 1 | Control |
20000 rows × 2 columns
# Clicks in 'Control' group: 4158
# Clicks in 'Experiment' group: 5025
Click probability in 'Control' group: 0.4158
Click probability in 'Experiment' group: 0.5025
Standard Error: 0.007047428999287613
-12.302358776337297
1.959963984540054
8.796717238230464e-35
Reject Ho !
[np.float64(0.073), np.float64(0.101)]