Are You Misinterpreting Your A/B Test Results?

June 14, 2016

Reading time:

6 minutes

Lauréline Saux

Laureline is Content Manager and is in charge of Kameleoon's content. She writes on best practice within A/B testing and personalization, based on in-depth analysis of the latest digital trends and conversations with Kameleoon's customers and consultants.

ACADEMY/A/B testing training blog

Successful A/B testing has multiple steps - from creating a process and understanding how to formulate a hypothesis to setting up your A/B tests and then knowing when to end them. While all of these are vital, making sure you’re correctly interpreting your results is just as important.

A/B testing is about learning, and making informed decisions based on your results. In this article — the fourth in our series on common A/B testing mistakes — we’ll look at ways you can potentially misinterpret your test results:

Missing false positives
Not checking segments correctly
Testing too many variables at once
Giving up on a test after it fails the first time

1 Missing false positives

there are actually 4 outcomes to an A/B Test

It isn't just a win or loss - it can be:

False positive (you detect a winner when there are none)
False negative (you don’t detect a winner when there is one)
No difference between A & B (inconclusive)
Win (either A or B converts more)

For a more detailed explanation look at hypothesis testing. It’s the mathematical method used for (frequentist) A/B Testing.)

Understanding this point is vital because you could have been interpreting false positives as genuine wins. And made business decisions based on it.

Take the well-known example from Google: 41 shades of blue. Doug Bowman, lead designer at the time, actually left (but for design reasons) the company because of this: “Yes, it's true that a team at Google couldn't decide between two blues, so they're testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4, or 5 pixels wide, and was asked to prove my case. I can't operate in an environment like that. I've grown tired of debating such minuscule design decisions...” (You can read his full article here).

Whether you agree or not with him that it’s wrong from a design standpoint, it’s also mathematically wrong depending on how you run these tests. You have two ways of approaching this:

You do “cascade testing”, i.e. A vs B, then B vs C, then C vs D. We'll explain why this fails below.
You do A/B/n testing, meaning you test all variations in parallel.

Cascade Testing

Imagine you want to test a different headline for a product page. You have your current one (A) against the new (B). B wins, but your boss doesn’t like the wording and wants you to try a slightly different version. Then you feel like you could do better and change it again. And again. You end up testing 10 different variations of this headline sequentially. Why is this a problem? Start with test 1. A vs B gave B as the winner with 95% statistical significance.

As we saw in a previous article, it means that there is a 5% chance this result is a “false positive”. Then you tested a third headline, B vs C. C also won with 95% significance. The problem is that the chance of a false positive compounds with the previous test. Your second test winner, C, actually has a 9% chance of being a false positive. After 10 tests on your headline (C vs D, D vs E, …), even with 95% significance on your tenth test, you actually have a 40% chance of your winner being a false positive! For Google's 41 variations it becomes 88%. That's why you should never do cascade testing.

A/B/n Testing

A/B/n testing is when you test n number of variations instead of just one (B) against your control (A). Meaning you have your control A, against variation B, C, D, E, F, etc, all at the same time, in the same conditions. This is much more statistically rigorous.

However, as we saw in our article on when to stop your A/B tests, you need at least 300 conversions per variation before you end the test. In our Google example, you would need 41x300 = 12,300 conversions. If you have Google-like traffic it’s okay, but for the rest of us it takes a very long time to hit these numbers. You could even be testing for too long and get skewed results. This kind of test is rarely needed and can often be completely avoided by starting with a better hypothesis.

2 Not checking segments correctly

Follow the rule of web analytics guru Avinash Kaushik: “Never report a metric without segmenting it to give deep insights into what that metric is really hiding behind it.” Most data you get from your analytics tool is aggregated data. It takes all traffic and creates pretty graphics that don't tell the whole story. Given that your visitors come to your website with different objectives, it has to deliver a range of experiences to them. Even if they come for the same reason, they probably don’t need the same content.

If you want an effective website, you can’t consider your traffic as a faceless blob - you need to segment. This applies to your test results. If you don’t segment them, you could be wrongly dismissing tests. An experiment could be resulting in your variation losing overall but winning on a particular segment. Be sure to check your segments before declaring a final verdict on a test.

Remember that when checking segments in an experiment result, the same rules apply concerning statistical validity. Before declaring that your variation won on a particular segment, check you have enough conversions and a large enough sample size on that segment. Here are three ways you can segment your data:

1. By source

Where do your visitors come from - for example ads, social networks, search engines, or your newsletter? Then look at data such as the pages they visit depending on where they arrived from, their bounce rate, difference in loyalty or if they return.

2. By behavior

What do they do on your website? People behave differently depending on their intent/needs. You can ask: what content do people visiting your website 10+ times a month read vs those only coming twice? What page do people looking at 5+ pages during a visit arrive on compared to those who only looked at one? Do they browse the same products/price range?

3. By outcome

Segment by the actions people took on your website, such as making a purchase, subscribing to a newsletter, downloading a premium resource or applying for a loyalty card, for example. Make groups of visitors with similar outcomes and ask the same type of questions we asked above. You’ll see which campaigns worked and which products are failing to engage customers. By segmenting you get actionable data and accurate results. With actionable data and accurate results you can make informed decisions, and with informed decisions boost revenues and business success.

3 Testing too many variables at once

High-impact changes are likely to potentially deliver major benefits. So you change the CTA, headline, add a video, a testimonial and change the text. Then you test it against your current page. And it wins. While this is positive it doesn't tell the whole story. How will you know which one(s) of your changes improved conversions on your page and which decreased them?

If you can’t really measure what happened, what moved the needle, testing is not so useful.

What if one of your changes positively impacted conversions and the others dragged it down? You might then count the test as a failure when elements were successful. Therefore, don’t test multiple variables at once. Unless you know how to do multivariate testing, but as this requires an enormous amount of traffic, it is rarely used.

4 Giving up on a test after it fails the first time

If you followed our guidelines on how to craft an informed hypothesis, each of your tests should be derived from these factors, with the best combining several of them:

Web Analytics
Heatmaps
Usability tests
User interview
Heuristic analysis

For example, you have analytics data showing people spending time on your product page, then leaving. You also have an on-page survey where visitors told you they weren’t quite convinced that your product answered their need. Your heuristic analysis showed you had clarity issues. Click maps demonstrate people going through all your product pictures. You then decide to test changing the copy and adding better pictures on the page to make it clearer. Your test ends and the results are inconclusive with no increase in conversions. What do you do now? Do you put a check mark in the “too bad” column, conclude clarity wasn’t in fact an issue and move on to another test?

Definitely not. A/B Testing is an iterative process. Take another look at your data and devise other ways to improve your page.

You could add testimonials
You could remove information that isn't relevant to your product
You could add a video

Embrace iterative testing. There isn’t just ONE solution to a given problem. There are an infinite number of them, and it could very well be a combination of several solutions that is successful.

Look for other solutions and test again, and again.

ACADEMY/Return to our A/B testing training course.

Topics covered by this article