A/B testing, also known as split testing, is a statistical method used in data science to compare two versions of a product, webpage, or marketing campaign to determine which one performs better based on a specific metric. This approach allows data scientists and marketers to make data-driven decisions rather than relying on intuition or guesswork.
A/B Testing General Procedure
- Problem Statement
- Hypothesis Testing
- Design the Experiment
- Run the Experiment
- Validity Checks
- Interpret the Results
- Launch Decision
Tips for Designing a Good Experiment
- Focus on the business goal first (user journey).
- Use the user funnel to identify the success metric.
- A success metric must be: measurable, attributable, sensitive and timely.
Example
Experiment Setup
A web store wants to change the product ranking recommendation system.
- Success Metric: revenue per day per user.
- Null Hypothesis (): the average revenue per day per user between the baseline and variant ranking algorithms are the same.
- Alternative Hypothesis (): the average revenue per day per user between the baseline and variant ranking algorithms are different.
- Significance Level (): If the p-value is , then reject and conclude .
- Statistical Power (): The probability of detecting an effect if the alternative hypothesis is true.
- Minimum Detectable Effect ( lift): If the change is at least 1% higher in revenue per day per user then it is practically significant.
Running the Experiment
- Set the randomization unit: user
- Target population in the experiment: visitors who searches a product
- Determine the sample size: , where is the sample standard deviation and is the difference between the control and treatment (based on and )
- Define the experiment duration
- Running
- Set up instruments and data pipelines to collect data
- Avoid peeking p-values
- Validity checks (search for bias)
- Check for instrumentation effects
- External factors
- Selection bias
- Sample ratio mismatch
- Novelty effect (e.g. segment by new and old visitors)
- Interpret the results
- Launch decision
- Metric trade-offs
- Cost of launching
Statistical Tests
Discrete Metrics
- Fisher’s Exact Test
- Pearson’s Chi-Squared Test
Continous Metrics
- Z-Test
- Student’s T-Test
- Welch’s T-Test
- Mann-Whitney U Test
Choosing the Right Test
flowchart TD
A{Discrete metric ?} -->|Yes| B{Large sample size ?}
B -->|Yes| C[Pearson's X2 Test]
B -->|No| D[Fisher's Exact Test]
A -->|No| E{Large sample size ?}
E -->|Yes| F{Variances known ?}
E -->|No| G{Normal distributions ?}
G -->|No| H[Mann-Whitney U Test]
G -->|Yes| F
F -->|Yes| J[Z-Test]
F -->|No| K{Similar variances ?}
K -->|Yes| L[Student's T-Test]
K -->|No| M[Welch's T-Test]
A/B Testing Example - Simulating Click Data
import pandas as pd
import numpy as np
from scipy.stats import norm
N_experiment = 10000
N_control = 10000
alpha = 0.05
click_experiment = pd.Series(np.random.binomial(1, 0.5, size=N_experiment))
click_control = pd.Series(np.random.binomial(1, 0.42, size=N_control))
df = pd.concat(
[
pd.DataFrame(
{
"Click": click_experiment,
"Group Label": "Experiment",
}
),
pd.DataFrame(
{
"Click": click_control,
"Group Label": "Control",
}
),
]
).reset_index(drop=True)
df
Click | Group Label | |
---|---|---|
0 | 1 | Experiment |
1 | 0 | Experiment |
2 | 1 | Experiment |
3 | 0 | Experiment |
4 | 1 | Experiment |
... | ... | ... |
19995 | 1 | Control |
19996 | 0 | Control |
19997 | 0 | Control |
19998 | 0 | Control |
19999 | 1 | Control |
20000 rows × 2 columns
X_experiment = df.groupby("Group Label")["Click"].sum().loc["Experiment"]
X_control = df.groupby("Group Label")["Click"].sum().loc["Control"]
print(
f"# Clicks in 'Control' group: {X_control}\n# Clicks in 'Experiment' group: {X_experiment}"
)
# Clicks in 'Control' group: 4158
# Clicks in 'Experiment' group: 5025
# calculating probabilities
p_experiment_hat = X_experiment / N_experiment
p_control_hat = X_control / N_control
print(
f"Click probability in 'Control' group: {p_control_hat}\nClick probability in 'Experiment' group: {p_experiment_hat}"
)
Click probability in 'Control' group: 0.4158
Click probability in 'Experiment' group: 0.5025
p_pooled_hat = (X_control + X_experiment) / (N_control + N_experiment)
pooled_variance = (
p_pooled_hat * (1 - p_pooled_hat) * (1 / N_control + 1 / N_experiment)
)
SE = np.sqrt(pooled_variance)
print(f"Standard Error: {SE}")
Standard Error: 0.007047428999287613
# Z-Test
test_stat = (p_control_hat - p_experiment_hat) / SE
print(test_stat)
-12.302358776337297
z_crit = norm.ppf(1 - alpha / 2)
print(z_crit)
1.959963984540054
p_val = 2 * norm.sf(abs(test_stat))
print(p_val)
8.796717238230464e-35
if p_val < alpha:
print("Reject Ho !")
else:
print("Does not reject Ho !")
Reject Ho !
# confidence interval
CI = [
round((p_experiment_hat - p_control_hat) - SE * z_crit, 3),
round((p_experiment_hat - p_control_hat) + SE * z_crit, 3),
]
CI
[np.float64(0.073), np.float64(0.101)]