Review: Online Experiments Done Right

Original Microsoft Experimentation Platform Logo
Image Credit: Microsoft

Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained

by Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker and Ya Xu.

There are a lot of people doing online A/B and multivariate testing these days, but few of them bring as much analytic rigor to the process as Ronny Kohavi and his colleagues. Ronny and his collaborators are back with a new paper that anyone who wants to get trustworthy results from online experimentation should read. Among the puzzling outcomes explained in this paper are:

  • Poor search results resulted in increased queries per user and revenue per user. Bing ran an A/B test in which a bug seriously degraded search result quality in the B version, but the business’s key performance indicators actually improved. Explanation: Due to the poor quality of the results, B version users retried their searches and clicked more on ads in an attempt to get the information they wanted. This increased searches per user and ad clicks per user in the short term, but in the long term, most users would have switched to Google!
  • A slower user experience resulted in more clicks. It has been known for years that website delays as short as 250 milliseconds can negatively impact revenue and user experience. Yet an experience which included a delay due to the need to update a session cookie resulted in more clicks being recorded. Explanation: An investigation showed that the apparent increase in clicks was actually due to a side effect of the way that some browsers work. Chrome, Firefox and Safari often terminate HTTP requests, including those used for click tracking, that are in progress when the user navigates away from the current site. This results in clicks being undercounted. When a delay was introduced, fewer click tracking requests were terminated and more clicks were thus recorded, even though in reality, users clicked less.
  • A trend from the first few days of an experiment didn’t hold in the long term. The explanation here is a great one that should be understood and internalized by everyone whose job entails interpreting statistical results. Explanation: During the first few days of a test, when there are fewer observations, there will be more variability in the results. In statistical terms, the confidence interval is wider. By random chance during these early days, the mean of the results will sometimes show a directional trend, but analysts need to remember that such trends are not significant. If the trend is in the experimenter’s hypothesized direction, confirmation bias will tempt him or her to believe that the hypothesis is being supported. However, over a longer period, regression to the mean will even out such spurious results.
  • For some metrics, running experiments for a longer period does not always result in greater statistical power. One of the explanations for this is that, for some metrics, such as sessions per user, the standard deviation increases faster than the mean. There’s a lot more to it, but the discussion is beyond the scope of this post. Readers looking for an in-depth explanation should refer to the paper.
  • Some experiments show surprising results, with metrics unrelated to the experimental change moving in unexpected directions. Explanation: This can occur in experimentation systems that use a “bucket system” to assign users to experiments. Such systems randomize users into different buckets and then assign buckets to experiments. The paper notes that Google, Bing and Yahoo use such systems. (We also use a bucketing system at Expedia.) In such a system, users in a particular bucket may adapt their patterns of interaction to the experimental treatment they are placed into. Even when the experiment ends, the habits they developed in the context of the experimental treatment persist. The behavior of the people in that bucket when exposed to the next experiment is no longer representative of the behavior of the population of all users. The paper explains mitigations for this effect.


This is another in the series of excellent online experimentation papers from Kohavi et. al. at Microsoft. Ronny has archived all the Microsoft Experimentation papers at Practitioners and academics who are serious about getting trustworthy results and doing online experiments correctly should read those papers.

Full disclosure: I had the pleasure of working on Ronny’s Experimentation Platform team at Microsoft for four years and had the opportunity to co-author two well-received papers during that time: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web and Online Experimentation at Microsoft.

Your comments are welcome. Off-topic comments will not be published. If you have a question unrelated to this post, click the "Contact" item in the navigation menu and submit it via email.

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s