Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Roger Longbotham, Principal Statistician, Microsoft.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Goal: Accelerate software innovation through trustworthy experimentation Enable a more scientific approach to planning and prioritization of features.
Roger Longbotham, Mgr Analytics, Experimentation Platform, Microsoft Slides available at
Swami NatarajanJune 17, 2015 RIT Software Engineering Reliability Engineering.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
SE 450 Software Processes & Product Metrics Reliability Engineering.
1 Validation and Verification of Simulation Models.
© 2009, Microsoft Corporation Sponsored By: Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation.
8-1 Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft.
Copyright ©2011 Pearson Education 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers using Microsoft Excel 6 th Global Edition.
Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide to Controlled Experiments on the Web (Kohavi et al.,
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 7-1 Chapter 7 Confidence Interval Estimation Statistics for Managers.
Experimental Statistics - week 2
Testing Hypotheses I Lesson 9. Descriptive vs. Inferential Statistics n Descriptive l quantitative descriptions of characteristics n Inferential Statistics.
STAT 5372: Experimental Statistics Wayne Woodward Office: Office: 143 Heroy Phone: Phone: (214) URL: URL: faculty.smu.edu/waynew.
Ronny Kohavi with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Slides available at
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 8-1 Chapter 8 Confidence Interval Estimation Basic Business Statistics 11 th Edition.
Confidence Interval Estimation
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Estimation PowerPoint Prepared by Alfred P. Rovai.
Copyright © 2009 Pearson Education, Inc. Chapter 21 More About Tests.
Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 18 Inference for Counts.
EVALUATE YOUR SITE’S PERFORMANCE. Web site statistics Affiliate Sales Figures.
Statistical Fundamentals: Using Microsoft Excel for Univariate and Bivariate Analysis Alfred P. Rovai Estimation PowerPoint Prepared by Alfred P. Rovai.
Chapter 21: More About Test & Intervals
6.1 Inference for a Single Proportion  Statistical confidence  Confidence intervals  How confidence intervals behave.
Chap 8-1 Chapter 8 Confidence Interval Estimation Statistics for Managers Using Microsoft Excel 7 th Edition, Global Edition Copyright ©2014 Pearson Education.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Ronny Kohavi, Microsoft Slides available at
Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,
Chapter 10: The t Test For Two Independent Samples.
Ronny Kohavi, Microsoft Joint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Based on KDD 2012 talk, available at
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
And distribution of sample means
9.3 Hypothesis Tests for Population Proportions
Step 1: Specify a null hypothesis
Chapter 9 Roadmap Where are we going?.
Planning, Running, and Analyzing Controlled Experiments on the Web
CHAPTER 10 Comparing Two Populations or Groups
Section 9.1 CI for a Mean Day 2.
Inference for the Difference Between Two Means
Comparing Two Populations or Treatments
Comparing Two Proportions
CHAPTER 10 Estimating with Confidence
Inference for Experiments
CHAPTER 8 Estimating with Confidence
Chapter 8: Inference for Proportions
CHAPTER 10 Comparing Two Populations or Groups
Inferences About Means from Two Groups
Statistical Methods For Engineers
Review: What influences confidence intervals?
Online Experimentation at Microsoft

CHAPTER 10 Comparing Two Populations or Groups
Chapter 10: Estimating with Confidence
Experiments We wish to conduct an experiment to see the effect of alcohol (explanatory variable) on reaction time (response variable).
What are their purposes? What kinds?
CHAPTER 10 Comparing Two Populations or Groups
Chapter 8: Estimating with Confidence
Lecture Slides Elementary Statistics Twelfth Edition
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
Testing Hypotheses I Lesson 9.
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
CHAPTER 10 Comparing Two Populations or Groups
Presentation transcript:

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web Roger Longbotham Ronny Kohavi Thomas Crook Brian Frasca Microsoft Corporation Experimentation Platform

Controlled Experiments in One Slide Concept is trivial Randomly split traffic between two (or more) versions A/Control B/Treatment Collect metrics of interest Analyze   Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in treatment Must run statistical tests to confirm differences are not due to chance

Pitfall 1: Wrong Success Metric Office Online tested new design for homepage Objective: increase sales of Office products Overall Evaluation Criterion (OEC) was clicks to the Buy Button Which one was better? Treatment Control

Pitfall 1: Wrong OEC Treatment had a drop in the OEC of 64%! Were sales for Treatment correspondingly less also? Our interpretation is that not having the price shown in the Control lead more people to click to determine the price It’s possible the Treatment group ended up with same number of sales – that data was not available Lesson: measure what you really need to measure, even if it’s difficult!

Pitfall 2: Incorrect Interval Calculation Confidence Intervals are a great way to summarize statistical results Example: CI for single mean, assuming Normal distribution Some cases not so straightforward e.g. CI for Percent Effect (Use Fieller’s formula) Example for difference in two means (Effect = = 0.62) and 95% CI for difference is (0,1.24) Percent effect is an increase of 62% and 95% CI for percent effect is (0%, 201%) => not symmetric about 62%

Pitfall 3: Using Standard formulas for Standard Deviation Most metrics for online experiments cannot use the standard statistical formulas Example: Click-through rate, CTR The standard statistical approach would assume this would be approximately Bernoulli. However, the true standard deviation can be much larger than that, depending on the site

Conversion Rate for two days Pitfall 4: Combining Data when Percent to Treatment Varies – Simpson’s Paradox Simplified example: 1,000,000 users per day For each individual day the Treatment is much better However, cumulative result for Treatment is worse Conversion Rate for two days Friday Saturday Total C/T split: 99/1 C/T split: 50/50 Control 20,000 = 2.02% 5,000 = 1.00% 25,000 = 1.68% 990,000 500,000 1,490,000 Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22% 10,000 510,000 One situation where you would face this issue is when ramping up the percent of users who are in the Treatment. As you gain confidence that the Treatment does not have bugs or other problems, you increase the percent in the Treatment at the beginning of the experiment.

Pitfall 5: Not Filtering out Robots Internet sites can get a significant amount of robot traffic (search engine crawlers, email harvesters, botnets, etc.) Robots can cause misleading results Most concerned about robots with high traffic (e.g. clicks or PVs) that stay in Treatment or Control (we’ve seen one robot with > 600,000 clicks in a month on one page) Identifying robots can be difficult Some robots identify themselves Many look like human users and even execute Javascript Use heuristics to ID and remove robots from analysis (e.g. more than 100 clicks in an hour)

Effect of Robots on A/A Experiment Each hour represents clicks from thousands of users The “spikes” can be traced to single “users” (robots)

Pitfall 6: Invalid Instrumentation Validating initial instrumentation Logging audit – compare experimentation observations with recording system of record A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both Expect about 5% of metrics to be statistically significant P-values should be uniformly distributed on the interval (0,1) and no p-values should be very close to zero (e.g. <0.001) A surprising number of partners initially fail either logging audit or A/A experiment It's easy to get a number from an experiment; it's much harder to get a correct number. Building trust in the result through audits is critical. 

Pitfall 7: Insufficient Experimental Control Must make sure the only difference between Treatment and Control is the change being tested Hourly click-through rate was plotted for T and C for recent experiment Headlines were supposed to be the same in both One headline was different for one 7 hour period changing result of experiment

Experimentation is Easy! But it requires vigilance and attention to details Good judgment comes from experience, and and a lot of that comes from bad judgment. -- Will Rogers