Are You Misinterpreting Your A/B Tests Results?

Are You Misinterpreting Your A/B Tests Results?

Once you have a process, know how to formulate a hypothesis, set up your A/B tests and know when to press stop, you’re all good, right? You’re on top of your A/B Testing!

Nope.

Making sure you’re correctly interpreting your results is just as important. A/B Testing is about learning, and making informed decisions based on your A/B tests results.

So let’s make sure we don’t mess that up!

In this article—fourth in our series on A/B Testing Mistakes, we’ll look into ways you’re possibly misinterpreting your tests results:

This article is the fourth in our series on A/B Testing Mistakes. If you want to get all 10,000 words at once NOW, you can download our ebook with all the articles.

You don’t know about false positives

Are you aware that there are actually 4 outcomes to an A/B Test?

What do you mean, it’s either a win or a loss, no?

Nope, it can be:

  • False positive (you detect a winner when there are none)
  • False negative (you don’t detect a winner when there is one)
  • No difference between A & B (inconclusive)
  • Win (either A or B converts more)

(If you’re a bit hardcore and want to know more about this, check out hypothesis testing. It’s the actual mathematical method used for (frequentist) A/B Testing.)

Why should you care?

willy wonka sarcastically saying he' wants to hear more about false positives

Because you could have been interpreting false positives as genuine wins. And invested money on it.

Take the well-known example from Google: 41 shades of blue. (Nope, it has nothing to do with the books.)

Doug Bowman, Lead designer at the time, actually left (but for design reasons) the company because of this:

“Yes, it’s true that a team at Google couldn’t decide between two blues, so they’re testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4, or 5 pixels wide, and was asked to prove my case. I can’t operate in an environment like that. I’ve grown tired of debating such minuscule design decisions…” (You can read the full article here).

Whether you agree or not with him that it’s wrong from a design standpoint, it’s also mathematically wrong depending on how you do it.

You have two ways of approaching this:

  1. You do “cascade testing”, i.e. A vs B, then B vs C, then C vs D, … THIS IS BAD, DON’T DO IT. We’ll see why in a second.
  2. You do A/B/n testing, meaning you test all variations in parallel.

Cascade Testing

Imagine you want to test a different headline for a product page. You have your current one (A) against the new (B). B wins, but your boss doesn’t like the wording and want you to try a slightly different wording. Then you feel like you could do better and change it again. And again.

You end up testing 10 different variations of this headline. How is it a problem?

Let’s take a look: A vs B gave B as the winner with 95% statistical significance. As we saw in a previous article, it means that there is a 5% chance this result is a complete fluke or a “false positive”.

Then you tested a third headline, B vs C. C also won with 95% significance. The problem is that the chance of a false positive compounds with the previous test. Your second test winner, C, has actually 9% chance of being a false positive.

After 10 tests on your headline (C vs D, D vs E, …), even with 95% significance on your tenth test, you actually have a 40% chance of your winner being a false positive! (For 41 variations it becomes 88%!!!)

You’d be flipping a coin. Or deliberately shooting yourself in the foot depending how many times you repeat this.

Don’t do cascade testing. Just don’t. Okay? Kittens will die if you do. Look at him, we don’t want that, do we ?

cute kitten that will die if you do cascade testing

A/B/n Testing

A/B/n Testing is when you test n number of variations instead of just one (B) against your control (A). Meaning you have your control A, against variation B, C, D, E, F, etc. at the same time, in the same conditions.

This is absolutely fine. BUT, as we saw in our article on when to stop your A/B tests, you need at least 300 conversions PER variation to call the test off.

In our Google example, you would need 41×300 = 12300 conversions. That’s a lot.

If you have Google-like traffic it’s okay. If you’re like us mere mortals though, this is a big fat loss of time.

You could even be testing for too long and get skewed results. This kind of tests is rarely needed and can often be completely avoided by having a better hypothesis.

You’re not checking your segments

Don’t make Avinash Kaushik sad (one of Web Analytics’ daddy if you’re wondering).

He has a rule: “Never report a metric without segmenting it to give deep insights into what that metric is really hiding behind it.”

Most data you get from your analytics tool is aggregated data. It takes all traffic and mash it out into pretty but absolutely not actionable graphics.

Your website has a number of functions, your visitors come with different objectives in mind. And even when they come for the same reason, they probably don’t need the same content.

If you want an effective website, you can’t consider your traffic as a faceless blob, you need to segment.

pile of mashed potato representing your data

It applies for your tests results. If you don’t segment them, you could be wrongly dismissing tests.

An experiment could be resulting in your variation losing overall but winning on a particular segment.

Be sure to check your segments before closing the book on a test!

Important side note: when checking segments in an experiment result, be sure not to forget that the same rules apply concerning statistical validity. Before declaring that your variation won on a particular segment, check you have enough conversions and a large enough sample size on that segment.

Here are three ways you can segment your data:

1. By source

Where do your visitors come from (Ads, Social Networks, Search Engines, Newsletter, …)?
Then you can look at things like: what pages they go depending on where they come from, their bounce rate, difference in loyalty, if they come back…

2. By behavior

What do they do on your website? People behave differently depending on their intent / needs.
You can ask: what content do people visiting your website 10+ times a month read vs those only coming twice? What page people looking at 5+ pages on a visit arrived on vs people who just looked at one? Do they look at the same products / price range?

3. By outcome

Segment by the actions people took on your website: bought a product, subscribed to a newsletter, downloaded a premium ressource, applied for a loyalty card, …

Make groups of visitors with similar outcomes and ask the same type of questions we asked above. You’ll see what campaigns worked, what products to kill, etc…

By segmenting you get actionable data and accurate results. With actionable data and accurate results you can make informed decisions, and with informed decisions … $$$$!

scrooge mc duck swimming in money because he made informed decisions thanks to ab testing

You’re testing too many variables at once

You got the message, you need to test high-impact changes.

So you change the CTA, headline, add a video, a testimonial and change the text. Then you test it against your current page. And it wins.

Good, right? Well… Not really.

How will you know which one(s) of your changes improved conversions on your page vs dragged it down?

This is where the question “How will I measure success” takes all its meaning. Testing is awesome, but if you can’t really measure what happened, what moved the needle, it’s not so useful.

What have you learned? That some combination of your changes improved conversions?

What if one of those positively impacted conversions and the others dragged it down? You counted the test as a failure and it wasn’t one.

Make sure to clearly specify what success looks like and that you’re set up to measure it.

If you can’t measure it, you won’t learn. If you don’t learn, you can’t reproduce nor improve it.

Don’t test multiple variables at once.

Unless you know how to do multivariate testing, then it’s fine. But as it requires a gigantic amount of traffic, we rarely see it used.

You give up on a test after it fails the first time

If you followed our guidelines on how to craft an informed hypothesis, each of your tests should be derived from (best should be a combination of several):

  • Web Analytics
  • Heatmaps
  • Usability tests
  • User interview
  • Heuristic analysis

For example, you have analytics data showing people staying a while on your product page, then leaving.

Cute panda asking the visitors of his website not to leave

You also have an on-page survey where visitors told you they weren’t quite convinced that your product answered their need.

Your heuristic analysis showed you had clarity issues.

Click maps show people going through all your product pictures.

You then decide to test changing the copy and add better pictures on the page to make it clearer.

Your test ends and … results are inconclusive, no increase in conversions.

What do you do now? You put a check mark in the “too bad” column, conclude clarity wasn’t in fact an issue and move on to another test?

No, of course you don’t. A/B Testing is an iterative process. Take another look at your data, devise ways to improve your page.

  • You could add testimonials
  • You could remove information not relevant to the product
  • You could add a video

As you now know not to do cascade testing (which is completely different than iterative because you don’t test X versions of the same headline/picture against the winner of a previous test), or test everything at once, you can embrace iterative testing.

There isn’t just ONE solution to a given problem. There are an infinite number of them, and it could very well be a combination of several solutions.

Jeff bezos quote about testing and being stubborn

Let’s be a tad extreme to illustrate this.

When your internet cuts off. What do you do? If you’re plugged through an ethernet cable, maybe you try unplugging/re-plugging it.

If it doesn’t change anything, do you then conclude that your cable is dead, and go buy a new one?

Or rather you try to plug it in an another computer, go check your router, restart your computer, check your drivers, …

Same thing with your A/B tests 😃

Don’t give up or jump to conclusion as soon as something doesn’t work. Look for other solutions and test again, and again.

Okay, you are now aware of several ways you could have been misinterpreting your A/B tests results, we’re making progress!

Next time, we’ll take a look inside our brains, how they could be playing tricks on us and jeopardizing our A/B Tests *cue spooky music*.

OR, if you’re highly motivated, you can download our 10,000 words ebook on A/B Testing Mistakes and get ALL the content RIGHT NOW.

PS: Before you go, a couple of things I’d like you to do:

  1. If this article was helpful in any way, please let met know. Either leave a comment, or hit me up on Twitter @kameleoonrocks.
  2. If you’d like me to cover a topic in particular or have a question, same thing: reach out.

It’s extremely important for me to know that I write content both helpful and focused on what matters to you.

I know it sounds cheesy and fake, but I feel like writing purely marketing, “empty” stuff—just for the exercise, is a fat loss of time for everyone. Useless for you, and extremely un-enjoyable for me to write.

So let’s work together!

PS2: If you missed the 3 first articles in our series, here they are:

Jean-Baptiste Alarcon

Jean-Baptiste is Growth Marketer at Kameleoon. Aside from reading a lot and drinking coffee like his life depends on it, he leads Kameleoon's growth on English markets.