Experimentation Challenges

Experimentation Challenges
Ensuring Trustworthiness

Accelerating innovation through trustworthy experimentation
Building a Duplo rocket is easy… Internal Validity & External Validity

Challenge 1: Trust your data Build a reliable data pipeline
Garbage in garbage out. You must have complete trust in your telemetry and pipeline. Solutions: Solid engineering (latency, completeness, reliability, robustness) Real time analytics, alerts, outlier detection Good metrics for data validity, upstream behavior Speller experiment detected a telemetry problem thanks to a rich set of metrics for data quality Real-time counters of flight assignment Missing data problem, No data left behind. Eg of completeness and double pipeline for fast and standard mode. Latency/completeness tradeoff Stress-test. Outage when TA exceeded capacity

Challenge 2: Where’s my flock? Account for all users
It’s easy to lose data on users. If you expect to experiment on 50%/50% your scorecard should have equal numbers of users in Treatment and Control search box to center, EXO mailboxid experiment Control: 12 slides Treatment: 16 slides

Challenge 2: Where’s my flock? Account for all users
It’s easy to loose data on users. If you expect to experiment on n%/n% your scorecard should have equal numbers of users in Treatment and Control Solution: Sample Ratio Mismatch detection Validate your treatment assignment early cf: A. Fabijan, J. Gupchup, S. Gupta, J. Omhover, W. Qin, L. Vermeer, P. Dimitriev. “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments” In review at International Data Science Conference 2019

Challenge 3: How random is random? Root out potential biases
Causality will only be established if the split is random. If there is any hidden bias, it can effect your results. Eg: carry-over effect Solution A/A testing Seedfinder Observed carry-over effect after a Bing experiment. cf.: R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, Y. Xu, “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained" in KDD 2012. Observed differences between two A/A groups 1 million randomizations

Challenge 4: The whole is greater than the sum of its parts
Challenge 4: The whole is greater than the sum of its parts. Isolate treatments and detect interactions Experiment 1: change the black font to blue. Experiment 2: change the background around ads from white to blue Solution: Isolated numberlines Interaction detection user (uid1) E1 E2 E3 About a dozen interactions per year in Bing, with over 10,000 experiments run

Challenge 4: The whole is greater than the sum of its parts
Challenge 4: The whole is greater than the sum of its parts. Isolate treatments and detect interactions Experiment 1: change the black font to blue. Experiment 2: change the background around ads from white to blue Solution: Isolated numberlines Interaction detection T2 T1 C1 C2 (M |E1, T2) = ? (M | E1,C2) About a dozen interactions per year in Bing, with over 10,000 experiments run Complexity= O(#metrics*#experiments^2) Need to control for type I errors (e.g. Bonferroni Correction)

Speller movement was only detected on the triggered scorecard
Challenge 5: Power-up! Increase the statistical power of your experiment If I have no feature (A/A) will I have any stat-sig movement in the scorecard? Type 1 (false positive) and Type 2 (false negative)errors. How can we avoid them? Solution: Increase Power Triggering and counterfactual logging Get more users: experimental design Reduce variance: CUPED / VR metrics Also see: R. Kohavi, “Triggering”. Speller movement was only detected on the triggered scorecard

p-values from standard and variance-reduced metric
Challenge 5: Power-up! Increase the statistical power of your experiment If I have no feature (A/A) will I have any stat-sig movement in the scorecard? Type 1 (false positive) and Type 2 (false negative)errors. How can we avoid them? Solution: Increase Power Triggering and counterfactual logging Get more users: experimental design Reduce variance: CUPED / VR metrics p-values from standard and variance-reduced metric cf. A. Deng, Y. Xu, R. Kohavi, T. Walker. “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data” in WSDM 2013.

20 metrics in 120 A/A experiments combined.
Challenge 6: Keep metrics healthy Implement and maintain correct metrics Bad metrics won’t tell you anything or they will lead you in the wrong direction. Solutions: Overall evaluation criterion (OEC): easy to move for the wrong reasons, hard to move for the right reasons Health checks: e.g. p-value uniformity Awareness of special cases levels of aggregation variance calculation (standard vs. Delta method) percentile metrics Also see: R. Kohavi, “The Overall Evaluation Criterion (OEC)”. 20 metrics in 120 A/A experiments combined. A. Deng, J. Lu, J. Litz. “Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions” in WSDM 2017.

Drill-down metrics provide insights.
Challenge 7: Dig deep Provide insights, be meticulous, investigate the unexpected It’s easy to misuse the power of a well built experimentation platform to draw incorrect inferences. (e.g. green jelly beans linked to acne (p-hacking)) “ Any figure that looks interesting or different is usually wrong.” --Twyman's law Solutions: solid statistical understanding and education effort (p-values) segments provide automated insights (e.g. segments of interest, movements at different agg level, A/A tagging, low sample size) validate primary effects, then analyze your secondary Drill-down metrics provide insights. If you run an experiment to show around 10% more ads on the page, you may be tempted to look straight at the revenue numbers.

Challenge 7: Dig deep Provide insights, be meticulous, investigate the unexpected
It’s easy to misuse the power of a well built experimentation platform to draw incorrect inferences. (e.g. green jelly beans linked to acne (p-hacking)) “ Any figure that looks interesting or different is usually wrong.” --Twyman's law Solutions: solid statistical understanding and education effort (p-values) segments provide automated insights (e.g. segments of interest, movements at different agg level, A/A tagging, low sample size) validate primary effects, then analyze your secondary Segmentation by date may trigger an automated segment of interest notification. If you run an experiment to show around 10% more ads on the page, you may be tempted to look straight at the revenue numbers.

Challenge 8: The benevolent dictator protect your users
As scale of the product and of the experimentation service grows, the potential for harm can be significant! If you have to kiss a lot of frogs to find a prince, find more frogs and kiss them faster and faster Mike Moran, Do it Wrong Quickly Solutions: fast auto-detection, alerting and shutdown the challenge is to be fast which requires good engineering and near real time analytics. But beware, data can be noisy at first. start small: “insider” rings, staged rollouts. But beware of Simpson’s paradox. Alert for high number of crashes in Office.

Challenge 9: Eat exotic foods Hard but pretty cool scenarios
Network affects mentions in MS Word comments) How can you maintain a clean control group without creating user DSAT? New approaches to randomization. Novel analysis to eliminate the cross-treatment effect. Meta-level experiments. Sometimes you’ll need to accept defeat. Enterprise Level Experimentation Metric design challenges (power, variance) Experiment design challenges (randomization, biases) @Mention feature in MsWord deals with network affects

Challenge 9: Eat exotic foods Hard but pretty cool scenarios
Network affects mentions in MS Word comments) How can you maintain a clean control group without creating user DSAT? New approaches to randomization. Novel analysis to eliminate the cross-treatment effect. Meta-level experiments. Sometimes you’ll need to accept defeat. Enterprise Level Experimentation Metric design challenges (power, variance) Experiment design challenges (randomization, biases) Distribution of tenant size cf. S. Liu, A. Fabijan, M. Furchtgott, S. Gupta, P. Janowski, W. Qin, P. Dimitriev. “Enterprise Level Controlled Experiments at Scale” in review at SEAA 2019.

Challenge 10: The Cultural Challenge
Why people/orgs avoid controlled experiments Some believe it threatens their job as decision makers. It may seem like poor performance if an idea you had is proven wrong. It’s easier to declare success when the feature launches. „We know what we’re doing” It is difficult to get a man to understand something when his salary depends upon his not understanding it Upton Sinclair HiPPO = Highest Paid Person’s Opinion. Hippos kill more humans than any other (non-human) land mammal. Listen to the customers and don’t let the HiPPO kill good ideas

Microsoft Confidential
EXP website: Microsoft Confidential

Experimentation Challenges

Similar presentations

Presentation on theme: "Experimentation Challenges"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Experimentation Challenges

Similar presentations

Presentation on theme: "Experimentation Challenges"— Presentation transcript:

Similar presentations

About project

Feedback