Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,

Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham, A&E Team A/B Testing Pitfalls Slides at http://bit.ly/ABPitfallshttp://bit.ly/ABPitfalls

Ronny Kohavi 2 A/B Tests in One Slide  Concept is trivial  Randomly split traffic between two (or more) versions o A (Control) o B (Treatment)  Collect metrics of interest  Analyze  A/B test is the simplest controlled experiment  A/B/n refers to multiple treatments (often used and encouraged: try control + two or three treatments)  MVT refers to multivariable designs (rarely used by our teams)  Must run statistical tests to confirm differences are not due to chance  Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s)

ConversionXL Audience Statistics Ronny Kohavi 3 83% of attendees ran less than 30 experiments last year. Experimenters at Microsoft use our ExP platform to start ~30 experiments per day

Experimentation at Scale  I’ve been fortunate to work at an organization that values being data-driven (video)video  We finish about ~300 experiment treatments per week, mostly on Bing, MSN, but also on Office, OneNote, Xbox, Cortana, Skype, Exchange, OneDrive. (These are “real” useful treatments, not 3x10x10 MVT = 300.)  Each variant is exposed to between 100K and millions of users, sometimes tens of millions  At Bing, 90% of eligible users are in experiments (10% are a global holdout changed once a year)  There is no single Bing. Since a user is exposed to over 15 concurrent experiments, they get one of 5^15 = 30 billion variants (debugging takes a new meaning).  Until 2014, the system was limiting usage as it scaled. Now the limits come from engineers’ ability to code new ideas Ronny Kohavi 4

Two Valuable Real Experiments  What is a valuable experiment?  Absolute value of delta between expected outcome and actual outcome is large  If you thought something is going to win and it wins, you have not learned much  If you thought it was going to win and it loses, it’s valuable (learning)  If you thought it was “meh” and it was a breakthrough, it’s HIGHLY valuable See http://bit.ly/expRulesOfThumb for some examples of breakthroughshttp://bit.ly/expRulesOfThumb  Experiments ran at Microsoft’s Bing with millions of users in each  For each experiment, we provide the OEC, the Overall Evaluation Criterion  Can you guess the winner correctly? Three choices are: o A wins (the difference is statistically significant) o Flat: A and B are approximately the same (no stat sig diff) o B wins 5

Example : Bing Ads with Site Links  Should Bing add “site links” to ads, which allow advertisers to offer several destinations on ads?  OEC: Revenue, ads constraint to same vertical pixels on avg  Pro adding: richer ads, users better informed where they land  Cons: Constraint means on average 4 “A” ads vs. 3 “B” ads Variant B is 5msc slower (compute + higher page weight) Ronny Kohavi 6 A B Raise your left hand if you think A Wins (left) Raise your right hand if you think B Wins (right) Don’t raise your hand if they are the about the same

Bing Ads with Site Links  If you raised your left hand, you were wrong  If you did not raise a hand, you were wrong  Site links generate incremental revenue on the order of tens of millions of dollars annually for Bing  The above change was costly to implement. We made two small changes to Bing, which took days to develop, each increased annual revenues by over $100 million Ronny Kohavi 7

Example: Underlining Links  Does underlining increase or decrease clickthrough-rate? Ronny Kohavi 8

Example 4: Underlining Links  Does underlining increase or decrease clickthrough-rate?  OEC: Clickthrough Rate on search engine result page (SERP) for a query Ronny Kohavi 9 A (with underlines) B (no underlines) Raise your left hand if you think A Wins (left, with underlines) Raise your right hand if you think B Wins (right, without underlines) Don’t raise your hand if they are the about the same

Underlines  If you raised your right hand, you were wrong  If you did not raise a hand, you were wrong  Underlines improve clickthrough-rate for both algorithmic results and ads (so more revenue) and improve time to successful click  Modern web designs do away with underlines, and most sites have adopted this design, despite data showing that users click less and take more time to click  For search engines (Google, Bing Yahoo), this is a very questionable industry direction Ronny Kohavi 10

Pitfall 1: Misinterpreting P-values  NHST = Null Hypothesis Statistical Testing, the “standard” model commonly used  P-value <= 0.05 is the “standard” for rejecting the Null hypothesis  P-value is often mis-interpreted. Here are some incorrect statements from Steve Goodman’s A Dirty Dozen 1. If P =.05, the null hypothesis has only a 5% chance of being true 2. A non-significant difference (e.g., P >.05) means there is no difference between groups 3. P =.05 means that we have observed data that would occur only 5% of the time under the null hypothesis 4. P =.05 means that if you reject the null hyp, the probability of a type I error (false positive) is only 5%  The problem is that p-value gives us Prob (X >= x | H_0), whereas what we want is Prob (H_0 | X = x) Ronny Kohavi 11

Pitfall 2: Expecting Breakthroughs  Breakthroughs are rare after initial optimizations.  At Bing (well optimized), 80% of ideas fail to show value  At other products across Microsoft, about 2/3 of ideas fail  Take Sessions/User, a key metric at Bing. Historically, it improves 0.02% of the time: that’s one in 5,000 treatments we try!  Most of the time, we invoke Twyman’s law (http://bit.ly/twymanLaw)http://bit.ly/twymanLaw  Note relationship to prior pitfall  With standard p-value computations, 5% of experiments will show stat-sig movement to Sessions/User when there is no real movement (i.e., if the Null Hypothesis is true), half of those positive  99.6% of the time, a stat-sig movement with p-value = 0.05 will be a false positive Ronny Kohavi 12 Any figure that looks interesting or different is usually wrong

Pitfall 3: Not Checking for SRM  SRM = Sample Ratio Mismatch  If you run an experiment with equal percentages assigned to Control/Treatment (A/B), you should have approximately the same number of users in each  Real example from an experiment alert I received this week:  Control: 821,588 users, Treatment: 815,482 users  Ratio: 50.2% (should have been 50%)  Should I be worried?  Absolutely  The p-value is 1.8e-6, so the probability of this split (or more extreme) happening by chance is less than 1 in 500,000  Note that the above statement is not a violation of the pitfall #1 because by the experiment design, there should be an equal number of users in control/treatment, so we want the conditional probability P(actual split=50.2% | designed split=50%) Ronny Kohavi 13

Pitfall 4: Wrong Success Metric (OEC)  Office Online tested new design for homepage  Objective: increase sales of Office products  Overall Evaluation Criterion (OEC) was clicks to the Buy Button [shown in red boxes] Which one was better? Control Treatment

Pitfall: Wrong OEC  Treatment had a drop in the OEC (clicks on buy) of 64%!  Not having the price shown in the Control lead more people to click to determine the price  Lesson: measure what you really need to measure: actual sales (it is more difficult at times)  Lesson 2: Focus on long-term customer lifetime value  Peep in keynote here said (he was OK with me mentioning this):  What’s the goal? More money right now  Common pitfall: You want to optimize long-term money. NOT right now. Raising prices gets you short-term money, but long-term abandonment  Coming up with a good OEC using short-term metrics is REALLY hard

Example: OEC for Search  KDD 2012 Paper: Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained KDD 2012  Search engines (Bing, Google) are evaluated on query share (distinct queries) and revenue as long-term goals  Puzzle  A ranking bug in an experiment resulted in very poor search results  Degraded (algorithmic) search results cause users to search more to complete their task, and ads appear more relevant  Distinct queries went up over 10%, and revenue went up over 30%  This problem is now in the book data science interviews exposed  What metrics should be in the OEC for a search engine? Ronny Kohavi 16

Puzzle Explained Ronny Kohavi 17

Bad OEC Example  Your data scientists makes an observation: 2% of queries end up with “No results.”  Manager: must reduce. Assigns a team to minimize “no results” metric  Metric improves, but results for query brochure paper are crap (or in this case, paper to clean crap)  Sometimes it *is* better to show “No Results.” Real example from my Amazon Prime now search 3/26/2016 https://twitter.com/ronnyk/status/713949552823263234 https://twitter.com/ronnyk/status/713949552823263234 Ronny Kohavi 18

Pitfall 5: Combining Data when Treatment Percent Varies with time  Simplified example: 1,000,000 users per day  For each individual day the Treatment is much better  However, cumulative result for Treatment is worse (Simpson’s paradox) Conversion Rate for two days FridaySaturday Total C/T split: 99/1C/T split: 50/50 Control 20,000 = 2.02% 5,000 = 1.00% 25,000 = 1.68% 990,000500,0001,490,000 Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22% 10,000500,000510,000

Pitfall 6: Get the Stats Right  Two very good books on A/B testing (A/B Testing from Optimizely founders Dan Siroker and Peter Koomen; and You Should Test That by WiderFunnel’s CEO Chris Goward) get the stats wrong (see Amazon reviews).  Optimizely recently updated their stats in the product to correct for this  Best techniques to find issues: run A/A tests  Like an A/B test, but both variants are exactly the same  Are users split according to the planned percentages?  Is the data collected matching the system of record?  Are the results showing non-significant results 95% of the time? Ronny Kohavi 20

More Pitfalls  See KDD paper: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web (http://bit.ly/expPitfalls)http://bit.ly/expPitfalls  Incorrectly computing confidence intervals for percent change  Using standard statistical formulas for computations of variance and power  Neglecting to filter robots/bots Lucrative business, as shown in photo I took ->  Instrumentation issues Ronny Kohavi 21

The HiPPO  HiPPO = Highest Paid Person’s Opinion  We made thousands toy HiPPOs and handed them at Microsoft to help change the culture  Grab one here at ConversionXL  Change the culture at your company  Fact: Hippos kill more humans than any other (non-human) mammal  Listen to the customers and don’t let the HiPPO kill good ideas Ronny Kohavi 22

Ronny Kohavi 23 Getting numbers is easy; getting numbers you can trust is hard Slides at http://bit.ly/ABPitfallshttp://bit.ly/ABPitfalls See http://exp-platform.com for papers. Plane reading booklets with selected papers available outside roomhttp://exp-platform.com Remember this

Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,

Similar presentations

Presentation on theme: "Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,

Similar presentations

Presentation on theme: "Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,"— Presentation transcript:

Similar presentations

About project

Feedback