A/B Testing

A/B testing, also known as split testing, is a statistical method used in data science to compare two versions of a product, webpage, or marketing campaign to determine which one performs better based on a specific metric. This approach allows data scientists and marketers to make data-driven decisions rather than relying on intuition or guesswork.

A/B Testing General Procedure

Problem Statement
Hypothesis Testing
Design the Experiment
Run the Experiment
Validity Checks
Interpret the Results
Launch Decision

Tips for Designing a Good Experiment

Focus on the business goal first (user journey).
Use the user funnel to identify the success metric.
A success metric must be: measurable, attributable, sensitive and timely.

Example

Experiment Setup

A web store wants to change the product ranking recommendation system.

Success Metric: revenue per day per user.
Null Hypothesis ( $Ho$ ): the average revenue per day per user between the baseline and variant ranking algorithms are the same.
Alternative Hypothesis ( $H a$ ): the average revenue per day per user between the baseline and variant ranking algorithms are different.
Significance Level ( $α = 0.05$ ): If the p-value is $< α$ , then reject $Ho$ and conclude $H a$ .
Statistical Power ( $p o w er = 0.80$ ): The probability of detecting an effect if the alternative hypothesis is true.
Minimum Detectable Effect ( $M D E = 1%$ lift): If the change is at least 1% higher in revenue per day per user then it is practically significant.

Running the Experiment

Set the randomization unit: user
Target population in the experiment: visitors who searches a product
Determine the sample size: $n \approx \frac{16 σ ^{2}}{δ ^{2}}$ , where $σ$ is the sample standard deviation and $δ$ is the difference between the control and treatment (based on $α = 0.05$ and $p o w er = 0.80$ )
Define the experiment duration
Running
- Set up instruments and data pipelines to collect data
- Avoid peeking p-values
Validity checks (search for bias)
- Check for instrumentation effects
- External factors
- Selection bias
- Sample ratio mismatch
- Novelty effect (e.g. segment by new and old visitors)
Interpret the results
Launch decision
- Metric trade-offs
- Cost of launching

Statistical Tests

Discrete Metrics

Fisher’s Exact Test
Pearson’s Chi-Squared Test

Continous Metrics

Z-Test
Student’s T-Test
Welch’s T-Test
Mann-Whitney U Test

Choosing the Right Test


flowchart TD

A{Discrete metric ?} -->|Yes| B{Large sample size ?}
B -->|Yes| C[Pearson's X2 Test]
B -->|No| D[Fisher's Exact Test]
A -->|No| E{Large sample size ?}
E -->|Yes| F{Variances known ?}
E -->|No| G{Normal distributions ?}
G -->|No| H[Mann-Whitney U Test]
G -->|Yes| F
F -->|Yes| J[Z-Test]
F -->|No| K{Similar variances ?}
K -->|Yes| L[Student's T-Test]
K -->|No| M[Welch's T-Test]

A/B Testing Example - Simulating Click Data

import pandas as pd
import numpy as np
from scipy.stats import norm

N_experiment = 10000
N_control = 10000
 
alpha = 0.05
 
click_experiment = pd.Series(np.random.binomial(1, 0.5, size=N_experiment))
click_control = pd.Series(np.random.binomial(1, 0.42, size=N_control))

df = pd.concat(
    [
        pd.DataFrame(
            {
                "Click": click_experiment,
                "Group Label": "Experiment",
            }
        ),
        pd.DataFrame(
            {
                "Click": click_control,
                "Group Label": "Control",
            }
        ),
    ]
).reset_index(drop=True)
df

	Click	Group Label
0	1	Experiment
1	0	Experiment
2	1	Experiment
3	0	Experiment
4	1	Experiment
...	...	...
19995	1	Control
19996	0	Control
19997	0	Control
19998	0	Control
19999	1	Control

20000 rows × 2 columns

X_experiment = df.groupby("Group Label")["Click"].sum().loc["Experiment"]
X_control = df.groupby("Group Label")["Click"].sum().loc["Control"]
print(
    f"# Clicks in 'Control' group: {X_control}\n# Clicks in 'Experiment' group: {X_experiment}"
)

# Clicks in 'Control' group: 4158
# Clicks in 'Experiment' group: 5025

# calculating probabilities
p_experiment_hat = X_experiment / N_experiment
p_control_hat = X_control / N_control
print(
    f"Click probability in 'Control' group: {p_control_hat}\nClick probability in 'Experiment' group: {p_experiment_hat}"
)

Click probability in 'Control' group: 0.4158
Click probability in 'Experiment' group: 0.5025

p_pooled_hat = (X_control + X_experiment) / (N_control + N_experiment)
pooled_variance = (
    p_pooled_hat * (1 - p_pooled_hat) * (1 / N_control + 1 / N_experiment)
)

SE = np.sqrt(pooled_variance)
print(f"Standard Error: {SE}")

Standard Error: 0.007047428999287613

# Z-Test
test_stat = (p_control_hat - p_experiment_hat) / SE
print(test_stat)

-12.302358776337297

z_crit = norm.ppf(1 - alpha / 2)
print(z_crit)

1.959963984540054

p_val = 2 * norm.sf(abs(test_stat))
print(p_val)

8.796717238230464e-35

if p_val < alpha:
    print("Reject Ho !")
else:
    print("Does not reject Ho !")

Reject Ho !

# confidence interval
CI = [
    round((p_experiment_hat - p_control_hat) - SE * z_crit, 3),
    round((p_experiment_hat - p_control_hat) + SE * z_crit, 3),
]
CI

[np.float64(0.073), np.float64(0.101)]

🗂️ Knowledge Wiki

Explorer