PM 101: Pitfalls of A/B Testing

A/B testing is a powerful tool—if used correctly

Jens-Fabian Goetzmann

9 June 2019 ‧ 8 min read

In the “PM 101” series I am sharing some PM basics that I wish I had known when I first started as a product manager. In this article, I am discussing some of the challenges of correctly running A/B tests.

A/B testing is one of the most powerful methods to validate that a product change improves the KPIs that measure customer or business value. Based on a randomized allocation of users to the control or test/“treatment” experience, it is able to determine with a high degree of certainty whether KPIs improved. Since as product managers we want to ensure that the changes we make to the product actually improve customer and business outcomes, it is nowadays used extremely frequently by product teams of all sizes and maturities.

Running A/B tests correctly requires a bit more diligence than just hooking up the A/B testing tool of your choice. In this article, I will discuss some pitfalls ranging from pretty obvious to rather subtle. Some of these can be prevented by the right tooling, but some are a matter of mindset and process rather than just understanding the numbers correctly.

The pitfalls I will dig into are the following:

Not having a real hypothesis
Using feature level metrics
Looking at too many metrics
Not having enough sample size
Peeking before reaching sample size
Changing allocation during the test
Not learning from failed tests
Using A/B testing as the only validation method

No real hypothesis

The worst way of running an A/B test is to implement a change, roll it out to A/B test, and then just “see what happens”. This is bad for all sorts of reasons; most notably, just due to probability, some potential success metrics are likely to see statistically significant improvements this way. What is even more concerning, though, is that you will not learn anything beyond the very limited scope of the product change itself from an A/B test that is run this way.

A good A/B test has a clear hypothesis. A hypothesis can take different forms, and you can find many templates. A good hypothesis should include at least the following:

A clear problem statement that you are trying to solve
The change you are making
The hypothesized impact on user behavior
The way to measure this impact (the key metric that is predicted to improve)

An example hypothesis containing all the above might be (for 8fit, a workout and nutrition app):

We are trying to solve the problem that many users never even try their first workout, and therefore do not realize the value that working out with 8fit might deliver. By automatically enrolling new users in a workout program matching their fitness level and suggesting they start their first workout right after sign-up, new users will realize the value of 8fit workouts and continue working out with 8fit, leading to an increase in second week workout retention.

Too many metrics

One point that is already included in the elements above but is worth calling out separately: An A/B test should have a single metric that you are basing the success of the test on. Looking at multiple metrics makes everything more complex: Firstly, the probability of false positives increases when looking at multiple metrics. Secondly, if some of the multiple metrics move in different directions, the trade-offs are unclear and A/B testing alone cannot decide those trade-offs.

Even if you have a single success metric, it is often tempting to look at secondary metrics to ensure that the experiment doesn't have unintentional side effects. Looking at secondary metrics in and of itself is not an issue, but you shouldn't start calculating statistical significance for them (unless you decrease the p-value required for something to be called significant across the board): Again, calculating significance for multiple metrics increases the risk of false positives (meaning some metrics movements are labelled statistically significant despite only being due to random fluctuations).

Using feature level metrics

Even if your hypothesis contains all the elements above, and you are only using a single success metric, it might still not be a good hypothesis. The pitfall that many inexperienced PMs fall for is using feature level metrics in A/B tests. Feature level metrics are typically metrics that measure the usage of a feature. For example, in a communication app, the following hypothesis contains a local metric: “If we include a floating button to compose a new message on the home screen, more new users will compose a message”. This hypothesis seems valid, but it has big problems: It is guaranteed that some users in the test group will tap this button, and if even a tiny fraction of them ends up sending a message, the test experience will win the A/B test. The hypothesis above is a glorified version of the very bad (and almost always true) hypothesis “If we create a new button, some people will click it” (or equally bad “If we make the button bigger, more people will click it”).

What you really want to measure are product level metrics, like user retention or engagement. Yes, more users sent their first message, but did it change anything about their behavior afterwards? Are they more likely to do other actions in the app? Are they more likely to come back?

The slightly more subtle variant of this pitfall is that some product level metrics might be so directly impacted by the change that they need to be tweaked in order to measure sustainable longer-term impact. In the communications tool example above, ”total number of messages sent” might be a valid product level engagement metric, but to measure whether the floating button has any sustainable impact on new users' behavior, we might have to filter out for example their messages sent on the first day. Similarly, in the 8fit example hypothesis given above, the success metric was not “workout completion rate” or something similar—that is bound to go up since we are pushing people into their first workout. Instead, we are measuring whether the change has a longer term impact by measuring second week workout retention (i.e., the proportion of users that come back in their second week to do a workout).

Not enough sample size

When you have set up a proper hypothesis, you can start thinking about the required sample size, i.e. the number of users that will have to be exposed to your experience to be able to measure a statistically significant improvement.

That required sample size is not something you make up—it's a function of the type of metric, the baseline level, and the improvement that you want to be able to detect, and you can find calculators online (or your data scientist or analyst can help you calculate it). In general, the smaller improvement you want to detect with confidence, the larger your required sample size will need to be. Small tweaks will often not yield very large improvements to the target metric, so the sample size required to detect those small changes might end up being quite large.

If you do not wait until you have reached the required sample size, you will not see statistically significant results, even if there was indeed a lift of the expected magnitude. Sometimes, however, waiting is impossible since the feature does not get enough traffic to reach sample size in a reasonable amount of time. In those cases, the simple takeaway is: you can't run a proper A/B test here.

Peeking

Once you have determined the required sample size, you need to wait until that sample size is reached to analyze the results and determine statistical significance. If you “peek” earlier and call the experiment whenever a statistically significant result shows, you are severely increasing the risk of false positives, due to fluctuations in the target metric over time, which is noise and not signal.

Of course, it's entirely possible to look at the results before the sample size is reached to see where the data is trending, or limit the downside risk if the test experience is completely tanking metrics, but you still shouldn't calculate statistical significance in those cases.

Changing allocation during the test

An A/B test in which 50% of the population is allocated to the control group and 50% to the test group is fine. An A/B test in which 90% of the population is allocated to the control group and 10% to the test group is also fine. What is not fine is for example starting a test at 90/10 (for example, to rule out any extremely negative impact) and then later changing to 50/50.

Changing allocation during the test may lead to wrong results due to Simpson's paradox, which states that a trend appearing in several different groups of data may disappear or reverse when these groups are combined. To make that a bit more tangible, consider the case shown in the table below: In the first period, 1000 users were eligible for the experiment, and they were enrolled 90% in control and 10% in the test group. The conversion rate, which was the metric that was being measured, was 15% in the control group and 16% in the test group. For the second period, the enrollment was changed to 50/50, and another 1000 users were eligible for the experiment. In the second period, the control group saw a conversion rate of 10% (for example, due to seasonality, or some other changes to the product), and the test group was still higher than control at 11%. When these two groups are combined, we see an aggregated conversion rate of 13.2% in control and 11.8% in test—despite having a higher conversion rate in both periods, in aggregate, the test conversion rate seems lower.

Therefore, in cases in which you want a slow roll-out at first and later increase the percentage, you should only start the proper A/B test once you have increased the percentage.

No learning in case of failure

Product management guru Marty Cagan says that there are two “inconvenient truths about product”:

The first such truth is that at least half of our ideas are just not going to work. [...] If that’s not bad enough, the second inconvenient truth is that even with the ideas that do prove to be valuable, usable and feasible, it typically takes several iterations to get the implementation of this idea to the point where it actually delivers the expected business value.

These truths have many implications, but specifically for A/B testing, it means that we should expect most of our A/B tests to have negative results (i.e., the hypothesis is not validated). Since every A/B test consumes valuable time and resources, it is paramount that we learn from every single A/B test, especially the ones that come up unsuccessful. And you don't just want to learn anything (you will always learn “this specific UX didn't perform measurably better than control“), you want useful information that help future improvement efforts.

How can we ensure that? Firstly, it is a question of mindset: if you go into the A/B test design with the expectation that there's a high chance it will fail, you are more likely to formulate the hypothesis and design the experiment in a way that will yield useful learnings. Secondly, as mentioned in the section on hypotheses, if you base your hypothesis on assumptions about user behavior and preferences, invalidating the hypothesis allows you to learn about these behaviors and preferences as well. Lastly, it is about ensuring that all A/B tests are based on as few assumptions as possible—ideally, only testing one assumption at a time. If an A/B test is based on multiple assumptions, then invalidating it can mean that one or more of those assumptions was wrong, but you don't know which. If it's just one, it's clear that that particular assumption was wrong.

A/B testing as the only validation method

As discussed above, most of our improvement ideas end up not working or needing multiple iterations to work out. If we are only using A/B testing to validate, it means we always have to build out the full solution before we learn whether the idea has any merit. We should therefore employ cheaper means of testing (qualitatively) before the A/B test, for example, by showing users prototypes of our solution and watching them interact with it.

Moreover, particularly innovative solutions are very hard to test using A/B testing. If the innovative solution solves a new problem, it may require new and different success metrics, and even if the existing metrics still apply, recall the inconvenient truth that “it typically takes several iterations” to deliver the expected business value—if the first iteration fails the A/B test, how do you know that this is an idea that is going to work out eventually after several iterations vs. an idea that was bad to begin with?

An extreme situation arises when product teams proclaim “we A/B test everything”—this is a trap since it means the team is prioritizing optimization over innovation, which is much harder to A/B test.

In summary, A/B testing is an extremely powerful tool. If wielded correctly, it can enable learning with a much higher degree of confidence than many other validation methods. However, the pitfalls above have real risks and should be taken into account by any product team running A/B tests.

I hope this article was helpful. If it was, feel free to follow me on Twitter where I share thoughts and articles on product management daily.