Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,

Slides:

Advertisements

Similar presentations

Chapter 7 Hypothesis Testing

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Introduction to Hypothesis Testing Chapter 8. Applying what we know: inferential statistics z-scores + probability distribution of sample means HYPOTHESIS.

Chapter 14 Comparing two groups Dr Richard Bußmann.

STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.

Chapter 10: Hypothesis Testing

Search Engines & Search Engine Optimization (SEO) Presentation by Saeed El-Darahali 7 th World Congress on the Management of e-Business.

Goal: Accelerate software innovation through trustworthy experimentation Enable a more scientific approach to planning and prioritization of features.

Chapter 8 Hypothesis Testing I. Significant Differences  Hypothesis testing is designed to detect significant differences: differences that did not occur.

Using Statistics in Research Psych 231: Research Methods in Psychology.

An Inference Procedure

Statistics for the Social Sciences Psychology 340 Fall 2006 Hypothesis testing.

Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Basic Business Statistics.

Statistics for the Social Sciences Psychology 340 Spring 2005 Hypothesis testing.

© 2009, Microsoft Corporation Sponsored By: Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation.

Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.

Inside the Mind of the 21st Century Customer Alan Page.

Today Concepts underlying inferential statistics

Independent Sample T-test Classical design used in psychology/medicine N subjects are randomly assigned to two groups (Control * Treatment). After treatment,

Using Statistics in Research Psych 231: Research Methods in Psychology.

Search Engine Optimization (SEO)

Statistics for the Social Sciences

Psy B07 Chapter 1Slide 1 ANALYSIS OF VARIANCE. Psy B07 Chapter 1Slide 2 t-test refresher  In chapter 7 we talked about analyses that could be conducted.

Experimental Statistics - week 2

Multiple testing correction

Hypothesis Tests In statistics a hypothesis is a statement that something is true. Selecting the population parameter being tested (mean, proportion, variance,

Tuesday, September 10, 2013 Introduction to hypothesis testing.

Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.

Psy B07 Chapter 8Slide 1 POWER. Psy B07 Chapter 8Slide 2 Chapter 4 flashback  Type I error is the probability of rejecting the null hypothesis when it.

Significance Tests …and their significance. Significance Tests Remember how a sampling distribution of means is created? Take a sample of size 500 from.

Ronny Kohavi with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Slides available at

Five Challenging Problems for A/B/n Tests Slides at (Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons.

Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.

A statistical method for testing whether two or more dependent variable means are equal (i.e., the probability that any differences in means across several.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

Search Engines & Search Engine Optimization (SEO).

User Study Evaluation Human-Computer Interaction.

Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.

1 rules of engagement no computer or no power → no lesson no SPSS → no lesson no homework done → no lesson GE 5 Tutorial 5.

From Theory to Practice: Inference about a Population Mean, Two Sample T Tests, Inference about a Population Proportion Chapters etc.

Advanced Higher Physics Investigation Report. Hello, and welcome to Advanced Higher Physics Investigation Presentation.

How would you optimize your SEM campaigns to get improve the efficiency?

Hypothesis Tests In statistics a hypothesis is a statement that something is true. Selecting the population parameter being tested (mean, proportion, variance,

Chapter 20 Testing hypotheses about proportions

Statistics (cont.) Psych 231: Research Methods in Psychology.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Introduction Scientists, mathematicians, and other professionals sometimes spend years conducting research and gathering data in order to determine whether.

1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.

1 Chapter 9 Hypothesis Testing. 2 Chapter Outline  Developing Null and Alternative Hypothesis  Type I and Type II Errors  Population Mean: Known 

Lecture 9 Chap 9-1 Chapter 2b Fundamentals of Hypothesis Testing: One-Sample Tests.

A/B testing aka split testing, bucket testing, and multivariant testing.

Copyright © 2014, 2011 Pearson Education, Inc. 1 Chapter 21 The Simple Regression Model.

Welcome to MM570 Psychological Statistics

From Pure Math to Data Science: My Perspective Paul Raff Principal Data Scientist, Analysis and Experimentation, Microsoft October 1, 2015 My

Challenging Problems in Online Controlled Experiments Slides at Ron Kohavi, Distinguished Engineer,

Review of Statistics.  Estimation of the Population Mean  Hypothesis Testing  Confidence Intervals  Comparing Means from Different Populations  Scatterplots.

Week 1 Introduction to Search Engine Optimization.

26134 Business Statistics Week 4 Tutorial Simple Linear Regression Key concepts in this tutorial are listed below 1. Detecting.

BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.

Ronny Kohavi, Microsoft Slides available at

Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.

The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.

Inferential Statistics Psych 231: Research Methods in Psychology.

Ronny Kohavi, Microsoft Joint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Based on KDD 2012 talk, available at

Pitfalls in Online Controlled Experiments Slides at

Pitfalls in Online Controlled Experiments Slides at

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

The Benefits of Online Controlled Experimentation at Scale

Setting significance levels at the correct level

Agenda What is SEO ? How Do Search Engines Work? Measuring SEO success ? On Page SEO – Basic Practices? Technical SEO - Source Code. Off Page SEO – Social.

Presentation transcript:

Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham, A&E Team A/B Testing Pitfalls Slides at

Ronny Kohavi 2 A/B Tests in One Slide  Concept is trivial  Randomly split traffic between two (or more) versions o A (Control) o B (Treatment)  Collect metrics of interest  Analyze  A/B test is the simplest controlled experiment  A/B/n refers to multiple treatments (often used and encouraged: try control + two or three treatments)  MVT refers to multivariable designs (rarely used by our teams)  Must run statistical tests to confirm differences are not due to chance  Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)

ConversionXL Audience Statistics Ronny Kohavi 3 83% of attendees ran less than 30 experiments last year. Experimenters at Microsoft use our ExP platform to start ~30 experiments per day

Experimentation at Scale  I’ve been fortunate to work at an organization that values being data-driven (video)video  We finish about ~300 experiment treatments per week, mostly on Bing, MSN, but also on Office, OneNote, Xbox, Cortana, Skype, Exchange, OneDrive. (These are “real” useful treatments, not 3x10x10 MVT = 300.)  Each variant is exposed to between 100K and millions of users, sometimes tens of millions  At Bing, 90% of eligible users are in experiments (10% are a global holdout changed once a year)  There is no single Bing. Since a user is exposed to over 15 concurrent experiments, they get one of 5^15 = 30 billion variants (debugging takes a new meaning).  Until 2014, the system was limiting usage as it scaled. Now the limits come from engineers’ ability to code new ideas Ronny Kohavi 4

Two Valuable Real Experiments  What is a valuable experiment?  Absolute value of delta between expected outcome and actual outcome is large  If you thought something is going to win and it wins, you have not learned much  If you thought it was going to win and it loses, it’s valuable (learning)  If you thought it was “meh” and it was a breakthrough, it’s HIGHLY valuable See for some examples of breakthroughshttp://bit.ly/expRulesOfThumb  Experiments ran at Microsoft’s Bing with millions of users in each  For each experiment, we provide the OEC, the Overall Evaluation Criterion  Can you guess the winner correctly? Three choices are: o A wins (the difference is statistically significant) o Flat: A and B are approximately the same (no stat sig diff) o B wins 5

Example : Bing Ads with Site Links  Should Bing add “site links” to ads, which allow advertisers to offer several destinations on ads?  OEC: Revenue, ads constraint to same vertical pixels on avg  Pro adding: richer ads, users better informed where they land  Cons: Constraint means on average 4 “A” ads vs. 3 “B” ads Variant B is 5msc slower (compute + higher page weight) Ronny Kohavi 6 A B Raise your left hand if you think A Wins (left) Raise your right hand if you think B Wins (right) Don’t raise your hand if they are the about the same

Bing Ads with Site Links  If you raised your left hand, you were wrong  If you did not raise a hand, you were wrong  Site links generate incremental revenue on the order of tens of millions of dollars annually for Bing  The above change was costly to implement. We made two small changes to Bing, which took days to develop, each increased annual revenues by over $100 million Ronny Kohavi 7

Example: Underlining Links  Does underlining increase or decrease clickthrough-rate? Ronny Kohavi 8

Example 4: Underlining Links  Does underlining increase or decrease clickthrough-rate?  OEC: Clickthrough Rate on search engine result page (SERP) for a query Ronny Kohavi 9 A (with underlines) B (no underlines) Raise your left hand if you think A Wins (left, with underlines) Raise your right hand if you think B Wins (right, without underlines) Don’t raise your hand if they are the about the same

Underlines  If you raised your right hand, you were wrong  If you did not raise a hand, you were wrong  Underlines improve clickthrough-rate for both algorithmic results and ads (so more revenue) and improve time to successful click  Modern web designs do away with underlines, and most sites have adopted this design, despite data showing that users click less and take more time to click  For search engines (Google, Bing Yahoo), this is a very questionable industry direction Ronny Kohavi 10

Pitfall 1: Misinterpreting P-values  NHST = Null Hypothesis Statistical Testing, the “standard” model commonly used  P-value <= 0.05 is the “standard” for rejecting the Null hypothesis  P-value is often mis-interpreted. Here are some incorrect statements from Steve Goodman’s A Dirty Dozen 1. If P =.05, the null hypothesis has only a 5% chance of being true 2. A non-significant difference (e.g., P >.05) means there is no difference between groups 3. P =.05 means that we have observed data that would occur only 5% of the time under the null hypothesis 4. P =.05 means that if you reject the null hyp, the probability of a type I error (false positive) is only 5%  The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want is Prob (H_0 | X = x) Ronny Kohavi 11

Pitfall 2: Expecting Breakthroughs  Breakthroughs are rare after initial optimizations.  At Bing (well optimized), 80% of ideas fail to show value  At other products across Microsoft, about 2/3 of ideas fail  Take Sessions/User, a key metric at Bing. Historically, it improves 0.02% of the time: that’s one in 5,000 treatments we try!  Most of the time, we invoke Twyman’s law (  Note relationship to prior pitfall  With standard p-value computations, 5% of experiments will show stat-sig movement to Sessions/User when there is no real movement (i.e., if the Null Hypothesis is true), half of those positive  99.6% of the time, a stat-sig movement with p-value = 0.05 will be a false positive Ronny Kohavi 12 Any figure that looks interesting or different is usually wrong

Pitfall 3: Not Checking for SRM  SRM = Sample Ratio Mismatch  If you run an experiment with equal percentages assigned to Control/Treatment (A/B), you should have approximately the same number of users in each  Real example from an experiment alert I received this week:  Control: 821,588 users, Treatment: 815,482 users  Ratio: 50.2% (should have been 50%)  Should I be worried?  Absolutely  The p-value is 1.8e-6, so the probability of this split (or more extreme) happening by chance is less than 1 in 500,000  Note that the above statement is not a violation of the pitfall #1 because by the experiment design, there should be an equal number of users in control/treatment, so we want the conditional probability P(actual split=50.2% | designed split=50%) Ronny Kohavi 13

Pitfall 4: Wrong Success Metric (OEC)  Office Online tested new design for homepage  Objective: increase sales of Office products  Overall Evaluation Criterion (OEC) was clicks to the Buy Button [shown in red boxes] Which one was better? Control Treatment

Pitfall: Wrong OEC  Treatment had a drop in the OEC (clicks on buy) of 64%!  Not having the price shown in the Control lead more people to click to determine the price  Lesson: measure what you really need to measure: actual sales (it is more difficult at times)  Lesson 2: Focus on long-term customer lifetime value  Peep in keynote here said (he was OK with me mentioning this):  What’s the goal? More money right now  Common pitfall: You want to optimize long-term money. NOT right now. Raising prices gets you short-term money, but long-term abandonment  Coming up with a good OEC using short-term metrics is REALLY hard

Example: OEC for Search  KDD 2012 Paper: Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained KDD 2012  Search engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goals  Puzzle  A ranking bug in an experiment resulted in very poor search results  Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevant  Distinct queries went up over 10%, and revenue went up over 30%  This problem is now in the book data science interviews exposed  What metrics should be in the OEC for a search engine? Ronny Kohavi 16

Puzzle Explained Ronny Kohavi 17

Bad OEC Example  Your data scientists makes an observation: 2% of queries end up with “No results.”  Manager: must reduce. Assigns a team to minimize “no results” metric  Metric improves, but results for query brochure paper are crap (or in this case, paper to clean crap)  Sometimes it *is* better to show “No Results.” Real example from my Amazon Prime now search 3/26/ Ronny Kohavi 18

Pitfall 5: Combining Data when Treatment Percent Varies with time  Simplified example: 1,000,000 users per day  For each individual day the Treatment is much better  However, cumulative result for Treatment is worse (Simpson’s paradox) Conversion Rate for two days FridaySaturday Total C/T split: 99/1C/T split: 50/50 Control 20,000 = 2.02% 5,000 = 1.00% 25,000 = 1.68% 990,000500,0001,490,000 Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22% 10,000500,000510,000

Pitfall 6: Get the Stats Right  Two very good books on A/B testing (A/B Testing from Optimizely founders Dan Siroker and Peter Koomen; and You Should Test That by WiderFunnel’s CEO Chris Goward) get the stats wrong (see Amazon reviews).  Optimizely recently updated their stats in the product to correct for this  Best techniques to find issues: run A/A tests  Like an A/B test, but both variants are exactly the same  Are users split according to the planned percentages?  Is the data collected matching the system of record?  Are the results showing non-significant results 95% of the time? Ronny Kohavi 20

More Pitfalls  See KDD paper: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web (  Incorrectly computing confidence intervals for percent change  Using standard statistical formulas for computations of variance and power  Neglecting to filter robots/bots Lucrative business, as shown in photo I took ->  Instrumentation issues Ronny Kohavi 21

The HiPPO  HiPPO = Highest Paid Person’s Opinion  We made thousands toy HiPPOs and handed them at Microsoft to help change the culture  Grab one here at ConversionXL  Change the culture at your company  Fact: Hippos kill more humans than any other (non-human) mammal  Listen to the customers and don’t let the HiPPO kill good ideas Ronny Kohavi 22

Ronny Kohavi 23 Getting numbers is easy; getting numbers you can trust is hard Slides at See for papers. Plane reading booklets with selected papers available outside roomhttp://exp-platform.com Remember this