Are You Stopping Your A/B Tests Too Early?
Stopping A/B Tests too early is without a doubt the most common—and one of the most potent, A/B Testing mistake.
But the answer is really not intuitive. We were going to write an article on when to stop your a/b test for both main statistical methods of A/B Testing, Frequentist, and Bayesian. But, we encountered two problems.
First, most people doing A/B Testing don’t know—and don’t care, whether their tool uses frequentist or bayesian statistics.
Second, when you dig around a bit in the different A/B Testing solutions, you find that no 2 software use exactly the same statistical method.
So, how could we write something helpful?
Here is what we came up with. We will do our best to answer the following question: What concepts do I need to understand not to stop my A/B Test too early?
Thus, we will cover today:
Note: None of these elements are stopping rules on their own. But having a better grasp of them will allow you to make better decisions.
Don’t trust test results below a 95% significance level. But don’t stop a test just because it reached this level.
When your A/B Testing tool tells you something along the lines of: “your variation has X % chances to beat the control”, it’s actually giving you the statistical significance level.
Another way to put it: “there is 5% (1 in 20) chance that the result you see is completely random”. Or “there is 5% chance that the difference in conversion measured between your control and variation is imaginary”.
You want at minimum 95%. Not less. Yes, 80% chance do sounds like a solid winner but that’s not why you’re testing. You don’t want just a “winner”. You want a statistically valid result. Your time & money are at stake, so let’s not gamble!
From experience, it’s not uncommon for a test to have a clear winner at 80% significance, and then it actually loses when you let it run properly.
Okay—so, if my tool tells me that my variation has 95% chance to beat the original, I’m golden right? Well … no. Statistical significance doesn’t imply statistical validity and isn’t a stopping rule on its own.
If you did a fake test, with the same version of a page, an A/A test, you’d have more than 70% chance that your test will reach 95% significance level at some point.
you got it, shoot for 95%+ significance level BUT don’t stop your test just because it reached it.
You need a sample representative of your overall audience (ignore that if you’re testing on a specific segment) and large enough not to be vulnerable to the data’s natural variability.
When you do A/B Testing you can’t measure your “true conversion rate” because it’s an ever-moving target.
You arbitrarily choose a portion of your audience with the following assumption: the selected visitors’ behavior will correlate with what would have happened with your entire audience.
Know your audience. Conduct a thorough analysis of your traffic before launching your A/B Tests.
Here is a couple of examples of things you need to know:
- How much of my visitors come from PPC, Direct Traffic, Organic Search, Email, Referral, …
- % Returning and New Visitors
The problem is your traffic keeps evolving, so you won’t know everything with 100% accuracy.
So, ask this: Is my sample representative of my entire audience, in proportions and composition?
Another issue if your sample is too small is the impact your outliers will have on your experiment. The smaller your sample is, the higher the variations between measures will be.
What does that mean? Let’s try with a real-life example.
Here’s the data from tossing a coin 10 times. H (head), T (tail). We know the “true” probability of our coin is 50%. We repeat the toss 5 times and track the % of heads.
The outcomes vary from 30% to 80%.
Same experience, but we toss the coin 100 times instead of 10.
The outcomes vary from 47% to 54%.
The larger your sample size is, the closer your result gets to the “true” value.
It’s so much easier to grasp with an actual example 🙂
With conversion rates, you could have your variation winning by far the first day because you had just shot your newsletter and the majority of your traffic were your clients for example.
They like you considerably more than normal visitors, so they reacted positively to your experiment.
Should you stop the test here, even with a significance level at 95%, you would have skewed results. The real result could be the exact opposite for all you know.
You made a business decision based on false data. Woops …
How big should your sample be ?
There is no magical number that will solve all your problems, sorry. It comes down to how much of an improvement you want to be able to detect. The bigger of a lift you want to detect, the smaller sample size you’ll need.
And even if you have Google-like traffic, it isn’t a stopping condition on its own. We’ll see that next.
One thing is true for all statistical methods though: the more data you collect, the more accurate or “trustworthy” your results will be.
But it varies depending on the method your tool uses.
I’ll give you here what we tell our clients, for our tool, that uses frequentist statistics.
Let me insist on the fact that those numbers might not be optimal for you if your tool doesn’t use frequentist statistics. That said, it won’t be detrimental on your results validity if you use them.
To determine sample size, we advise our clients to use a calculator like this one (we have one in our solution, but this one is extremely good too).
It gives an easy-to-read number, without you having to worry about the math too much. And it prevents you from being tempted to stop your test prematurely, as you’ll know that till this sample size is reached, you shouldn’t even look at your data.
You’ll need to input the current conversion rate of your page, the minimum lift you want to track (i.e. what is the minimum improvement you’d be happy with).
We then recommend at least 300 macro conversions (meaning your primary goal) per variation before even considering stopping the test.
I’ll repeat it again, It’s not a magic number.
We sometimes shoot for 1000 conversions /variation if our client’s traffic allows us to. The larger the better, as we saw earlier. It also could be a bit less if there is a considerable difference between the conversion rates of your control and variation.
Okay, so if I have lots of traffic and a large enough sample size with 95% in 3 days it’s great right?
Welp, kudos on your traffic, but sorry no again …
You should run your tests for full weeks at a time and we recommend you test for at least 2–3 weeks. If you can, make the test lasts for 1 (or 2) business cycle(s).
You already know that for emails and social media, there are optimal days (even hours) to post.
People behave differently on given days and are influenced by a number of external events. Well, same thing for your conversion rates. Don’t believe me, try. Run a conversion by day for a week you’ll see how much it can vary from a day to another.
This means, if you started a test on a Thursday, end it on a Thursday. (We’re not saying you should test for just one week.) Test for at least 2-3 weeks. More would be better though.
1 to 2 business cycles would be great. As you’ll get people that just heard of you and close to buying while accounting for most external factors (we’ll talk more about those in our article on external validity threats) and sources of traffic.
If you must extend the duration of your test, extend it by a full week.
Variability of data
If your significance level and/or the conversion rates of your variations are still fluctuating considerably, let your test running.
Two phenomenons to consider here:
- The novelty effect: When people react to your change just because it’s new. It will fade with time.
- Regression to the mean: This is what we talked about earlier: the more you record data, the more you approach the “true value”. This why your tests fluctuate so much at first, you have few measures so outliers have a considerable impact.
This is also why the significance level isn’t enough on its own. During a test, you’ll most likely reach several times 95% before you can actually stop your test.
Make sure your significance curve flattens out before calling it.
Same thing with the conversion rates of your variations, wait till the fluctuations are negligible considering the situation and your current rates.
For tools that give you a confidence interval, for example “variation A has a conversion rate of 18,4% ± 1,2% and Variation B 14,7% ± 0,8%”.
Meaning the conversion rate of the variation A is between (18,4 – 1,2) and (18,4 + 1,2), and conversion rate of the variation B is between (14,7 – 0,8) and (14,7 + 0,8).
If the 2 intervals overlap, keep testing. Your confidence intervals will get more precise as you gather more data.
So, whatever you do, don’t report on a test before it’s actually over. To resist the temptation to stop a test, it’s often best not to peek at the results before the end. If you’re unsure, it’s better to let it run a bit longer.
To stop an A/B Test, consider the following:
- Is your significance level equal or superior to 95%?
- Is your sample large enough and representative of your overall audience in composition and proportions?
- Have you run your test for the appropriate length of time?
- Have your significance level and conversion rates curves flattened out?
Only after taking all of those into account can you stop your test. Don’t skip them, it’d cost you money.
All right, that’s all for today. We hope this article helped understand those concepts better and will reduce the risk of you calling a test too early.
We will definitely come back with additional content on this particular topic in the future.
Until next time …
PS: Before you go, a couple of things I’d like you to do:
- If this article was helpful in any way, please let met know. Either leave a comment, or hit me up on Twitter @kameleoonrocks.
- If you’d like me to cover a topic in particular or have a question, same thing: reach out.
It’s extremely important for me to know that I write content both helpful and focused on what matters to you.
I know it sounds cheesy and fake, but I feel like writing purely marketing, “empty” stuff—just for the exercise, is a fat loss of time for everyone. Useless for you, and extremely un-enjoyable for me to write.
So let’s work together!
PS2: If you missed the other articles in our series, here they are: