Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
Roger Longbotham Ronny Kohavi Thomas Crook Brian Frasca Microsoft Corporation Experimentation Platform

Controlled Experiments in One Slide
Concept is trivial Randomly split traffic between two (or more) versions A/Control B/Treatment Collect metrics of interest Analyze Best scientific way to prove causality, i.e., the changes in metrics are caused by changes introduced in treatment Must run statistical tests to confirm differences are not due to chance

Pitfall 1: Wrong Success Metric
Office Online tested new design for homepage Objective: increase sales of Office products Overall Evaluation Criterion (OEC) was clicks to the Buy Button Which one was better? Treatment Control

Pitfall 1: Wrong OEC Treatment had a drop in the OEC of 64%!
Were sales for Treatment correspondingly less also? Our interpretation is that not having the price shown in the Control lead more people to click to determine the price It’s possible the Treatment group ended up with same number of sales – that data was not available Lesson: measure what you really need to measure, even if it’s difficult!

Pitfall 2: Incorrect Interval Calculation
Confidence Intervals are a great way to summarize statistical results Example: CI for single mean, assuming Normal distribution Some cases not so straightforward e.g. CI for Percent Effect (Use Fieller’s formula) Example for difference in two means (Effect = = 0.62) and 95% CI for difference is (0,1.24) Percent effect is an increase of 62% and 95% CI for percent effect is (0%, 201%) => not symmetric about 62%

Pitfall 3: Using Standard formulas for Standard Deviation
Most metrics for online experiments cannot use the standard statistical formulas Example: Click-through rate, CTR The standard statistical approach would assume this would be approximately Bernoulli. However, the true standard deviation can be much larger than that, depending on the site

Conversion Rate for two days
Pitfall 4: Combining Data when Percent to Treatment Varies – Simpson’s Paradox Simplified example: 1,000,000 users per day For each individual day the Treatment is much better However, cumulative result for Treatment is worse Conversion Rate for two days Friday Saturday Total C/T split: 99/1 C/T split: 50/50 Control 20,000 = 2.02% 5,000 = 1.00% 25,000 = 1.68% 990,000 500,000 1,490,000 Treatment 230 = 2.30% 6,000 = 1.20% 6,230 = 1.22% 10,000 510,000 One situation where you would face this issue is when ramping up the percent of users who are in the Treatment. As you gain confidence that the Treatment does not have bugs or other problems, you increase the percent in the Treatment at the beginning of the experiment.

Pitfall 5: Not Filtering out Robots
Internet sites can get a significant amount of robot traffic (search engine crawlers, harvesters, botnets, etc.) Robots can cause misleading results Most concerned about robots with high traffic (e.g. clicks or PVs) that stay in Treatment or Control (we’ve seen one robot with > 600,000 clicks in a month on one page) Identifying robots can be difficult Some robots identify themselves Many look like human users and even execute Javascript Use heuristics to ID and remove robots from analysis (e.g. more than 100 clicks in an hour)

Effect of Robots on A/A Experiment
Each hour represents clicks from thousands of users The “spikes” can be traced to single “users” (robots)

Pitfall 6: Invalid Instrumentation
Validating initial instrumentation Logging audit – compare experimentation observations with recording system of record A/A experiment – run a “mock” experiment where users are randomly assigned to two groups but users get Control in both Expect about 5% of metrics to be statistically significant P-values should be uniformly distributed on the interval (0,1) and no p-values should be very close to zero (e.g. <0.001) A surprising number of partners initially fail either logging audit or A/A experiment It's easy to get a number from an experiment; it's much harder to get a correct number. Building trust in the result through audits is critical.

Pitfall 7: Insufficient Experimental Control
Must make sure the only difference between Treatment and Control is the change being tested Hourly click-through rate was plotted for T and C for recent experiment Headlines were supposed to be the same in both One headline was different for one 7 hour period changing result of experiment

Experimentation is Easy!
But it requires vigilance and attention to details Good judgment comes from experience, and and a lot of that comes from bad judgment Will Rogers

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Similar presentations

Presentation on theme: "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Similar presentations

Presentation on theme: "Seven Pitfalls to Avoid when Running Controlled Experiments on the Web"— Presentation transcript:

Similar presentations

About project

Feedback