27 May 2013

A/B Testing: Being pragmatic with statistical significance

One of the first things that we did before starting any of the A/B tests that I’ve previously written about was to work out how many users we needed to go through before we could be sure that the results we saw were statistically significant.

We used the prop.test function from R to do this and based on our traffic at the time worked out that we’d need to run a test for 6 weeks to achieve statistical significance.

That means that we could manage a maximum of 8 tests a year i.e. 8 changes a year which isn’t really feasible for a website which needs to evolve quickly to remain competitive.

We therefore had two ways that we could use A/B tests and make changes more frequently:

Don’t A/B test everything
Speed up the cycle by:
- iterating within a test i.e. making tweaks/fixing bugs
- ending the test early if control or experiment was consistently better.

Don’t A/B test everything

When we were looking through our product backlog it was interesting to see that there were some features that we agreed would be definite improvements while there were others that we weren’t really sure about.

As Nate Silver points out in 'The Signal and the Noise' whenever we rely on intuition we introduce bias but in this case the people making the call were experts of the domain so it seemed like a reasonable approach.

If we felt that a feature made the user experience more delightful or if it was an integral part of the way we were driving the product then we’d just make the change without running an A/B test.

We still had daily metrics which would highlight if things had been made significantly worse but we didn’t track the performance of those features with the same vigour as you would with an A/B test.

Speed up the cycle

Although it would take 6 weeks to be completely confident in an A/B test we were able to get feedback much more quickly as to which branch (master/experiment) was performing better.

If we noticed a drop off in conversion for a specific browser then we’d make the appropriate bug fixes while the test was still running.

In an ideal world I suppose we should have started the test from scratch but we didn’t do this unless there was something seriously wrong and we had to make major changes.

Instead we started tracking the daily conversion rate which allowed us to see whether our bug fixes had made an impact.

Tracking the daily conversion rate also allowed us to see whether one branch was consistently better or not - if we saw that one branch was consistently better over a week or two then we’d be confident that we should end the test.

On the other hand early on in the test cycle we’d see the control converting better on the first day, then the experiment converting better the next day before the control converted better for the next two.

When we observed that behaviour we assumed that random variation was at play and waited a few more days to see if it would settle down before we made a decision.

We were effectively trading off statistical significance against the number of tests that we could run which in this context was a reasonable trade off to make.

In summary

I’d not done A/B testing before I worked at uSwitch and I think the approach we took seems reasonable for most web based products.

We want to use data to help us make decisions but we need to leverage our intuition/domain expertise rather than relying completely on the data.

In other words we’re more data assisted rather than data driven.

About the author

I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.