11 mins read
## Common problems and pitfalls of A/B tests

**Confounding Effects**

### Selection Bias

### Systematic Bias

### Early Stopping or P-hacking

**Spillover or Network Effects**

### Change Aversion and Novelty Effects

### Sample Ratio Mismatch

### Inadequate Choice of Test Period

### Running too many tests at the same time

This is a straightforward guide to an A/B test case study created and utilized by Udacity, as part of my Data Scientist Nanodegree program. Further information about their program can be found here.

Audacity is an online learning platform that offers courses to its users. The user journey on Udacity’s website can be summarized as follows:

Home page viewing > Course exploration > Course overview page viewing > Course enrollment > Course completion.

However, Audacity experienced a decrease in user retention as they progressed through these stages, with only a small number reaching the final stage. To increase student engagement, Udacity conducted A/B tests to experiment with changes that would enhance conversion rates from one stage to the next. Their experiment involved altering the homepage to a more appealing design, with the goal of increasing the number of users who move on to the second stage of course exploration in the funnel.

The metric used was the **click-through rate(CTR)** for the Explore Courses button on the home page. So the null and alternative hypotheses were:

*H0: CTR_new — CTR_old <= 0*

*H1: CTR_new — CTR_old > 0*

In this experiment, we defined CTR as **the number of unique visitors who click at least once divided by the number of unique visitors who view the page.**

They collected data for 4 months with a control group size of 3332, and an experiment group size of 2996.

We calculated the **control group CTR** and **experiment group CTR**:

```
control_df = df.query(‘group == “control”’) #select control group
control_ctr = control_df.query(‘action == “click”’).id.nunique() / control_df.query(‘action == “view”’).id.nunique() #calculate ctr
experiment_df = df.query('group == "experiment"') #select exp group
experiment_ctr = experiment_df.query('action == "click"').id.nunique() / experiment_df.query('action == "view"').id.nunique() #calculate ctr
```

Here we have a control CTR of 0.27971 and an experiment CTR of 0.30975.

- We computed the observed difference between the metric, i.e. click-through rate, for the control and experiment groups.

```
obs_diff = experiment_ctr - control_ctr #calculate obs difference
```

Here we have the observed difference of 0.03003.

2. We simulated the sampling distribution using bootstrapping for the difference in proportions (or difference in click-through rates).

```
diffs = []
for _ in range(10000): #simulated 10000 times
b_sample = df.sample(df.shape[0], replace=True) #bootstrapping
control_df = b_sample.query("group == 'control'")
experiment_df = b_sample.query('group == "experiment"')
control_ctr = control_df.query('action == "click"').id.nunique() / control_df.query('action == "view"').id.nunique()
experiment_ctr = experiment_df.query('action == "click"').id.nunique() / experiment_df.query('action == "view"').id.nunique()
diffs.append(experiment_ctr - control_ctr)
diffs = np.array(diffs) #convert simulated differences to np array
plt.hist(diffs, bins=50);
```

3. We used this sampling distribution to simulate the distribution under the null hypothesis, by creating a random normal distribution centered at 0 with the same spread and size.

```
np.random.seed(42) #make result reproducible
null_vals = np.random.normal(0, diffs.std(), diffs.size)
ax = plt.hist(null_vals, bins=50)
plt.axvline(x=obs_diff, color='red', linestyle='--')
plt.text(x = obs_diff*1.05, y = ax[0].mean(), s=' observed\ndifference');
```

As we observed, the chance to see our observed difference under our null hypothesis seemed to be small.

4. We computed the p-value by finding the proportion of values in the null distribution that was greater than our observed difference.

```
p_value = (null_vals > obs_diff).mean()
```

Here we have the p-value calculated to be 0.00660

5. We used this p-value to determine the **statistical significance** of our observed difference.

With a type I error rate tolerance of 0.05, we reject the null hypothesis and decide to implement the new (experiment) homepage.

In the next experiment, we changed our course overview to a more career-focused description and hoped that this change may encourage more users to enroll and complete this course.

Four metrics were defined and used:

**Enrollment Rate**: Click-through rate for the Enroll button on the course overview page.**Average****Reading****Duration**: Average number of seconds spent on the course overview page.**Average****Classroom****Time**: Average number of days spent in the classroom for students enrolled in the course.**Completion****Rate**: Course completion rate for students enrolled in the course.

We can use the same techniques as the parts above: calculate metric values for both control and experiment groups, calculate the observed difference, bootstrap the sample and simulate a distribution, simulate the distribution under the null hypothesis and calculate the p-value, and make decisions.

After analysis, we got the statistics below:

```
# p-values:
Enrollment Rate: 0.0223
Average Reading Duration: 0.0000
Average Classroom Time: 0.0374
Completion Rate: 0.0861
```

However, the more metrics you evaluate, the more likely you are to observe significant differences just by chance. One of the approaches which can address the problem is to apply a **Bonferroni Correction**. If our original alpha value was 0.05, with four tests, we should choose our new Bonferroni corrected alpha value to be `0.05/4=0.0125`

. Given our p statistics, we can conclude that only the A**verage reading duration** is statistically significantly different.

**Difficulties in A/B testing**

There are many factors to consider when designing an A/B test and drawing conclusions based on its results. Some common ones are:

- Novelty effect and change aversion when existing users first experience a change.
- Sufficient traffic and conversions to have significant and repeatable results.
- Consistency among test subjects in the control and treatment groups.
- The best metric choice for making the ultimate decision eg. measuring revenue vs. clicks.
- The practical significance of a conversion rate is the cost of launching a new feature vs. the gain from the increase in conversion.
- Long enough run time for the experiment to account for changes in behavior based on time of day/week or seasonal events.

In order to not fail your online experiment, it’s important to follow the specified guidelines and to patiently go through the list of actions that should occur to end up with a well-prepared and executed A/B experiment. The following will be presented common problems and pitfalls of A/B testing that are made frequently with their corresponding solutions.

It’s important to ensure that all other known possible factors that also have an impact on the dependent variable are held constant. Therefore, you need to control for as many unwanted or unequal factors (also called *extraneous variables*) as possible. Extraneous variables matter when they are associated with both the independent and the dependent variables. One special and extreme case of this problem occurs when the relationship between the independent and dependent variable completely changes/inverts when one takes into account certain spurious variables, this is often referred to as **Simpson’s Paradox**.

The reason why one needs to control for these effects is that assigning units to treatments at random tends to mitigate confounding, which makes effects due to factors other than the treatment appear to result from the treatment. So, confounding effects threaten the Internal Validity of your A/B experiment. The following solutions might help you to avoid this problem.

- Control of confounding variables
- Reliable instruments (IV or 2SLS estimation)
- Appropriate choice of independent and dependent variables
- Generation of a random sample

One of the fundamental assumptions of A/B testing is that your sample needs to be unbiased and every type of user needs to have an equal probability to be included in that sample. If by some error, you have excluded a specific part of the population. (e.g. sampling for the average weight of the USA by only sampling one state: last time example about education) then we call this a Selection Bias.

*To check whether your sample is biased *while knowing the true population distribution, you create B bootstrapped samples from your sample and draw the distribution of sampling means. If this distribution is not centered around the true population mean then your sample is biased and you should use more **solid sampling** techniques to randomly sample an unbiased sample.

This problem relates to the way one measures the impact of the treatment (a new version of the product or a feature). Are you systematically making errors when measuring it? This type of error always affects measurements by the same amount or by the same proportion, given that a reading is taken the same way each time, hence it is predictable. Unlike random error which mainly affects the precision of the estimation results, the systematic error affects the *accuracy* of the results.

A common mistake in an A/B experiment is to stop the experiment early once you observe a statistically significant result (e.g., a small p-value) while the significance level and all other model parameters are predetermined in the Power Analysis s stage of the A/B testing and assume that the experiment will run until the minimum sample size is achieved.

P-hacking or early stopping affects the Internal Validity of the results and makes them biased and it also leads to false positives.

This problem usually occurs when an A/B test is performed on Social media platforms such as Facebook, Instagram, and TikTok but also in other products where the users in the experimental and control group connected, for example, are in the same group or community and influence each other’s response to the experimental and controlled product versions. This problem leads to biased results and wrong conclusions since it violates the integrity of the test and control effects.

To detect Network Effects, you can perform* Stratified Sampling* and then divide this into two groups. Then you can run an A/B test on one sample taking into account the clustered samples and the other one without. If there is a difference in the treatment effects, then there is a Network Effect problem.

When you are testing significant changes on the product and the user doesn’t want that, at first the users might try it out just out of curiosity, even if the feature is not actually better than the controlled/current version, this is called *Novelty Effect* and it affects the internal validity of your results. Moreover, new features (experimental product version ) might also affect the overall user experience making some users churn since they don’t like this new version. This phenomenon is often referred to as *Change Aversion*.

One of the most popular used ways to check for the Novelty Effect is by segmenting users into *new vs old*. If the feature is liked by returning users, but not by new users, then most likely you are dealing with Novelty Effect.

If it appears to you that the split between control and experimental looks suspicious suggesting that the treatment assignment process looks suspicious as more users are assigned to the control/experimental groups than to the experimental/control, then you can perform a Chi-square test. This test will help you to formally check for Sample Ratio Mismatch. You can read more about this test here.

Another common mistake in A/B testing is the choice of the test period. As mentioned earlier, one of the fundamental assumptions of A/B testing is that every type of user needs to have an equal probability to be included in that sample. However, if you run your test in a period that doesn’t take into account holidays, seasonality, weekends, and any other relevant events then the probability of the different types of users being selected is no longer the same (for example weekend shoppers, holiday shoppers, etc.). For example, running a test Sunday morning is different than running the same test on Tuesday at 11 pm.

When you have more than 1 experimental variant for your product that you want to test such that you have multitasking where more than 2 variants are presented then you can no longer use the same significance level to test for statistical significance. So, the p-value or the significance level that the results will be compared to needs to be adjusted.

To conduct a “successful” A/B test, there’s so much more to consider.