Download presentation
Presentation is loading. Please wait.
1
© 2009, Microsoft Corporation Sponsored By: Top 7 Testing Pitfalls Presented live November 18, 2009 Featuring Guest Star: Ronny Kohavi GM, Microsoft Experimentation Platform Admin Note: Attendees will also get a copy of these slides + an On-demand mp3 of this via email on Thursday afternoon November 19th
2
© 2009, Microsoft Corporation First: Why Bother Testing? -> ‘Best Practices’, standard Web design templates, and marketer’s “gut” often FAIL tests. -> For previously untested sites, testing gives an average ~ 40% conversion lift. -> Tests can help you generate better quality leads or sales – not just more conversions. WhichTestWon.com
3
© 2009, Microsoft Corporation Agenda Intro & controlled experiments in one slide Examples: you’re the decision maker Seven pitfalls Q&A Pitfalls based on KDD 2009 paper: http://exp-platform.com/ExPpitfalls.aspx by Thomas Crook, Brian Frasca, Ronny Kohavi, and Roger Longbothamhttp://exp-platform.com/ExPpitfalls.aspx 3
4
© 2009, Microsoft Corporation Our Experience at Microsoft The Experimentation Platform started at Microsoft in 2006Experimentation Platform Experiments ran on 20 Microsoft properties, including MSN home pages in several countries, MSN Money, MSN Real estate, www.microsoft.com, store.microsoft.com, support.microsoft.com, Office Online, www.xbox.com, several marketing sites, and Windows Genuine Advantage Large experiments run with tens of millions of users Multiple experiments have projected annual improvements of over $1M each
5
© 2009, Microsoft Corporation Controlled Experiments in One Slide Concept is trivial – Randomly split traffic between two (or more) versions A (Control) B (Treatment) – Collect metrics of interest – Analyze 5 Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in the treatment(s) Must run statistical tests to confirm differences are not due to chance
6
© 2009, Microsoft Corporation Examples Three experiments that ran at Microsoft All had enough users for statistical validity OEC: the Overall Evaluation Criterion See how many you get right – Three choices are: A wins (the difference is statistically significant) A and B are approximately the same (no stat sig diff) B wins 6
7
© 2009, Microsoft Corporation Office Online Test new design for Office Online homepage A OEC: Clicks on revenue generating links (red below) Is A better, B better, or are they about the same? B
8
© 2009, Microsoft Corporation Office Online B was 64% worse The Office Online team wrote A/B testing is a fundamental and critical Web services… consistent use of A/B testing could save the company millions of dollars 8
9
© 2009, Microsoft Corporation MSN UK Hotmail experiment Hotmail module on the MSN UK home page
10
© 2009, Microsoft Corporation MSN UK Hotmail experiment A: When user clicks on email hotmail opens in same window B: Open hotmail in separate window Trigger: only users that click in the module are in experiment (no diff otherwise) OEC: clicks on home page (after trigger) Is A better, B better, or are they about the same?
11
© 2009, Microsoft Corporation UK Hotmail For those in the experiment, clicks on MSN Home Page increased +8.9% <0.001% of users in B wrote negative feedback about the new window 11
12
© 2009, Microsoft Corporation Data Trumps Intuition We distribute experiment reports widely at Microsoft Someone who saw the report wrote This report came along at a really good time and was VERY useful. I argued this point to my team (open Live services in new window from HP) just some days ago. They all turned me down. Funny, now they have all changed their minds. 12
13
© 2009, Microsoft Corporation MSN Home Page Search Box OEC: Clickthrough rate for Search box and popular searches A B Differences: A has taller search box (overall size is the same), has magnifying glass icon, “popular searches” B has big search button Is A better, B better, or are they about the same?
14
© 2009, Microsoft Corporation Search Box No statistically significant difference Insight Stop debating, it’s easier to get the data 14
15
© 2009, Microsoft Corporation Hard to Assess the Value of Ideas: Data Trumps Intuition At Amazon, half of the experiments failed to show improvement QualPro tested 150,000 ideas over 22 years – 75 percent of important business decisions and business improvement ideas either have no impact on performance or actually hurt performance… Based on experiments with ExP at Microsoft – 1/3 of ideas were positive ideas and statistically significant – 1/3 of ideas were flat: no statistically significant difference – 1/3 of ideas were negative and statistically significant Our intuition is poor: 2/3 rd of ideas do not improve the metric(s) they were designed to improve. Humbling! 15
16
© 2009, Microsoft Corporation The HiPPO Our opinions are often wrong – get the data HiPPO stands for the Highest Paid Person’s Opinion Hippos kill more humans than any other (non-human) mammal (really) Don’t let HiPPOs in your org kill innovative ideas. ExPeriment! We give out these toy HiPPOs at Microsoft 16 The less data, the stronger the opinions
17
© 2009, Microsoft Corporation Is Software Just Hard? NO! Doctors have been taking the HiPPocratic Oath and promising “no harm,” yet many beliefs were wrong for hundreds of years For centuries, an illness was thought to be a toxin Opening a vein and letting the sickness run out was the best solution – bloodletting One British medical text recommended bloodletting for acne, asthma, cancer, cholera, coma, convulsions, diabetes, epilepsy, gangrene, gout, herpes, indigestion, insanity, jaundice, leprosy, ophthalmia, plague, pneumonia, scurvy, smallpox, stroke, tetanus, tuberculosis, and for some one hundred other diseases Physicians often reported the simultaneous use of fifty or more leeches on a given patient. Through the 1830s the French imported about forty million leeches a year for medical purposes
18
© 2009, Microsoft Corporation Bloodletting (2 of 2) President George Washington had a sore throat and doctors extracted 82 ounces of blood over 10 hours (35% of his total blood), causing anemia and hypotension. He died that night Pierre Louis did an experiment in 1836 that is now recognized as one of the first clinical trials, or randomized controlled experiment. He treated people with pneumonia either with – early, aggressive bloodletting, or – less aggressive measures At the end of the experiment, Dr. Louis counted the bodies. They were stacked higher over by the bloodletting sink Lancet
19
© 2009, Microsoft Corporation Agenda Intro & controlled experiments in one slide Examples: you’re the decision maker Seven pitfalls Q&A 19
20
© 2009, Microsoft Corporation Pitfall 1: Wrong Success Metric Remember this example? A OEC: Clicks on revenue generating links (red below) B
21
© 2009, Microsoft Corporation Pitfall 1: Wrong OEC B had drop in the OEC of 64% Were sales correspondingly less also? No. The experiment is valid if the conversion from a click to purchase is similar The price was shown only in B, sending more qualified purchasers to the pipeline Lesson: measure what you really need to measure, even if it’s difficult!
22
© 2009, Microsoft Corporation Pitfall 2: Incorrect Interval Calculation Confidence Intervals (CI) are a great way to summarize results that have variability Example: 95% CI for conversion rate might be 2.8%-3.2% (mean of 3.0% +/- 0.2%), which improved from 1.8%-2.2% Business users prefer percent effect: 2% to 3% is a 50% improvement in conversion! How can we provide a confidence interval on the 50%?
23
© 2009, Microsoft Corporation Pitfall 2: Incorrect Interval Calculation (cont) You can’t just convert the confidence interval to a percent effect because the denominator is a random variable (we have a ratio of means) Use Fieller’s formula for an exact percent effectFieller’s formula – More complex formula, but that’s why we have computers (and statisticians who figured this out in 1954) – Note: the confidence interval is not always symmetric around the mean in this case
24
© 2009, Microsoft Corporation Pitfall 3: Using Standard formulas for Standard Deviation Many metrics for online experiments cannot use the standard statistical formulas Example: Click-through rate = clicks/page-views The standard statistical approach would assume this would be approximately Bernoulli However, the true standard deviation is commonly larger than Bernoulli because of independence violations Solution: Bootstrap or the delta method
25
© 2009, Microsoft Corporation Best Practice: Ramp-up Ramp-up – Start an experiment at 0.1% – Do simple analyses to make sure no egregious problems can be detected – Ramp-up to a larger percentage, and repeat until desired percent (e.g., 50%) Big differences are easy to detect because the min sample size is quadratic in the effect we want to detect – Detecting 10% difference requires a small sample and serious problems can be detected during ramp-up – Detecting 0.1% requires a population 100^2 = 10,000 times bigger 25
26
© 2009, Microsoft Corporation Pitfall 4: Combining Data when Percent to Treatment Varies Simplified example: 1,000,000 users per day For each individual day the Treatment is much better However, cumulative result for Treatment is worse This is called Simpson’s Paradox Conversion Rate for two days FridaySaturday Total C/T split: 99/1C/T split: 50/50 Control 20,000 = 2.02% 5,000 = 1.00% 25,000 = 1.68% 990,000500,0001,490,00 0 Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22% 10,000500,000510,000
27
© 2009, Microsoft Corporation Pitfall 5: Not Filtering out Robots Internet sites can get a significant amount of robot traffic (search engine crawlers, email harvesters, botnets, etc.) Robots can cause misleading results – Most concerned about robots with high traffic (e.g. clicks or PVs) that stay in Treatment or Control – We’ve seen one robot with > 600,000 clicks in a month on one page (and it was executing JavaScript)
28
© 2009, Microsoft Corporation Pitfall 5: Not Filtering out Robots (cont) Identifying robots can be difficult – Some robots identify themselves through the UserAgent – Many look like human users and execute Javascript – Use heuristics to ID and remove robots from analysis (e.g. more than 100 clicks in an hour) – Ongoing research. No silver bullet
29
© 2009, Microsoft Corporation Effect of Robots on A/A Experiment Each hour represents clicks from thousands of users The “spikes” can be traced to single “users” (robots)
30
© 2009, Microsoft Corporation Pitfall 6: Invalid or Inadequate Instrumentation Validating initial instrumentation – Logging audit – compare experimentation observations with recording system of record – A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both Expect about 5% of metrics to be statistically significant P-values should be uniformly distributed on the interval (0,1) and no p- values should be very close to zero (e.g. <0.001) – Many of our “customers” initially fail one of these tests
31
© 2009, Microsoft Corporation Pitfall 7: Insufficient Experimental Control Must make sure the only difference between Treatment and Control is the change being tested Plot shows hourly click-through rate for Control and Treatment in the MSN Home Page Headlines were supposed to be the same in both One headline was different for one 7 hour period, significantly changing the result
32
© 2009, Microsoft Corporation Summary 1.It is hard to assess the value of ideas – Get the data by experimenting because data trumps intuition – Examples are humbling – Avinash Kaushik wrote: “…the power of: Controlled Experiments. I am convinced this is God’s gift to online humanity.” Avinash Kaushik wrote 2.Replace the HiPPO with an OEC – Make sure the org agrees what you are optimizing (long term lifetime value) – Experts are often wrong. Doctors did bloodletting for centuries (and they swear by the HiPPOcratic oath) 3.Watch out for the pitfalls 32
33
© 2009, Microsoft Corporation Resources for Deeper Drive Controlled Experiments on the Web: Survey and Practical Guide in Data Mining and Knowledge Discovery journal, 2009 http://exp-platform.com/hippo_long.aspx http://exp-platform.com/hippo_long.aspx KDD 2009 Tutorial http://exp-platform.com/tutorial.aspx http://exp-platform.com/tutorial.aspx Contact: ronnyk@ microsoft dot you know what
34
© 2009, Microsoft Corporation Live Q&A with Anne, Ronny, Roger WhichTestWon.com
35
© 2009, Microsoft Corporation Thanks, plus 2 free offers: Online Testing Awards Free entries Everyone eligible Deadline this Friday! http://whichtestwon.com/awards Free Landing Page Evaluation Offer Click to schedule: http://whichtestwon.com/widerfunnel/lp.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.