Experimentation Challenges

Slides:

Advertisements

Similar presentations

Decision Errors and Power

Advertisements

Opinion Spam and Analysis Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago.

Goal: Accelerate software innovation through trustworthy experimentation Enable a more scientific approach to planning and prioritization of features.

Statistics Micro Mini Threats to Your Experiment!

Performance Evaluation

Inside the Mind of the 21st Century Customer Alan Page.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Evaluation Eyal Ophir CS 376 4/28/09. Readings Methodology Matters (McGrath, 1994) Practical Guide to Controlled Experiments on the Web (Kohavi et al.,

Approaches to ---Testing Software Some of us “hope” that our software works as opposed to “ensuring” that our software works? Why? Just foolish Lazy Believe.

4.00 Understand promotion and intermediate uses of marketing- information.

ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?

Managing Guest Feedback In Real Time With Maestro’s MAESTRO USER CONFERENCE 2015.

Conducting a User Study Human-Computer Interaction.

Ronny Kohavi with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Slides available at

Introduction ANOVA Mike Tucker School of Psychology B209 Portland Square University of Plymouth Drake Circus Plymouth, PL4 8AA Tel: +44 (0)

Five Challenging Problems for A/B/n Tests Slides at (Follow-on talk to KDD 2015 keynote on Online Controlled Experiments: Lessons.

Significance Toolbox 1) Identify the population of interest (What is the topic of discussion?) and parameter (mean, standard deviation, probability) you.

Object-Oriented Software Engineering Practical Software Development using UML and Java Chapter 1: Software and Software Engineering.

Data Science: Statistics in the Wild Jin Kim. Hi, I’m Jin! My professional mission is to help people and organizations understand and improve themselves.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Conducting a User Study Human-Computer Interaction.

Research Process Parts of the research study Parts of the research study Aim: purpose of the study Aim: purpose of the study Target population: group whose.

Human-Computer Interaction. Overview What is a study? Empirically testing a hypothesis Evaluate interfaces Why run a study? Determine ‘truth’ Evaluate.

Marketing Research Approaches. Research Approaches Observational Research Ethnographic Research Survey Research Experimental Research.

Software Engineering 2004 Jyrki Nummenmaa 1 BACKGROUND There is no way to generally test programs exhaustively (that is, going through all execution.

AAPOR 69 th Annual Conference Anaheim, CAMay 17, 2014 Casey Langer Tesfaye and Susan White Can a survey of U.S. High Schools be Replaced or Reduced through.

Challenging Problems in Online Controlled Experiments Slides at Ron Kohavi, Distinguished Engineer,

 Seeks to determine group membership from predictor variables ◦ Given group membership, how many people can we correctly classify?

Ronny Kohavi, Microsoft Slides available at

Ronny Kohavi, Distinguished Engineer, General Manager, Analysis and Experimentation, Microsoft Joint work with Thomas Crook, Brian Frasca, and Roger Longbotham,

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Ronny Kohavi, Microsoft Joint work with Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, Ya Xu Based on KDD 2012 talk, available at

Thank you/Appreciate time Intro me- Manage channel last 2 years

Pitfalls in Online Controlled Experiments Slides at

Why is Research Important?

Pitfalls in Online Controlled Experiments Slides at

Statistics in Clinical Trials: Key Concepts

Let’s Get It Straight! Re-expressing Data Curvilinear Regression

Travelling to School.

How to Start This PowerPoint® Tutorial

Reliable and UNRELIABLE Sources

Reasoning in Psychology Using Statistics

Seven Pitfalls to Avoid when Running Controlled Experiments on the Web

Testing in Production Key to Data Driven Quality

Chapter 8 – Software Testing

Technology & Analytics

Andy Wang CIS 5930 Computer Systems Performance Analysis

Utilizing AI & GPUs to Build Cloud-based Real-Time Video Event Detection Solutions Zvika Ashani CTO.

Understanding Results

Methodology Overview 2 basics in user studies Lecture /slide deck produced by Saul Greenberg, University of Calgary, Canada Notice: some material in this.

Research design I: Experimental design and quasi-experimental research

Conducting a User Study

The Benefits of Online Controlled Experimentation at Scale

Research methods Lesson 2.

Thinking like a Scientist

Module 02 Research Strategies.

Experimental Design.

Online Experimentation at Microsoft

Experimental Design.

Scientific Method Steps

Protecting Users Pavel Dmitriev, Microsoft Analysis & Experimentation

Ensuring Trustworthiness and High Quality

Protecting Users Pavel Dmitriev, Microsoft Analysis & Experimentation

Reasoning in Psychology Using Statistics

Megaputer Intelligence

Detectives in the Classroom - Investigation 2-6: The Journey

Lecture 1: Descriptive Statistics and Exploratory

The Forecaster’s Imperative | September 19, 2018

Intro to Epidemiology - Investigation 2-6: The Journey

Reasoning in Psychology Using Statistics

Presentation transcript:

Experimentation Challenges Ensuring Trustworthiness

Accelerating innovation through trustworthy experimentation Building a Duplo rocket is easy… Internal Validity & External Validity

Challenge 1: Trust your data Build a reliable data pipeline Garbage in garbage out. You must have complete trust in your telemetry and pipeline. Solutions: Solid engineering (latency, completeness, reliability, robustness) Real time analytics, alerts, outlier detection Good metrics for data validity, upstream behavior Speller experiment detected a telemetry problem thanks to a rich set of metrics for data quality Real-time counters of flight assignment Missing data problem, No data left behind. Eg of completeness and double pipeline for fast and standard mode. Latency/completeness tradeoff Stress-test. Outage when TA exceeded capacity

Challenge 2: Where’s my flock? Account for all users It’s easy to lose data on users. If you expect to experiment on 50%/50% your scorecard should have equal numbers of users in Treatment and Control search box to center, EXO mailboxid experiment Control: 12 slides Treatment: 16 slides

Challenge 2: Where’s my flock? Account for all users It’s easy to loose data on users. If you expect to experiment on n%/n% your scorecard should have equal numbers of users in Treatment and Control Solution: Sample Ratio Mismatch detection Validate your treatment assignment early cf: A. Fabijan, J. Gupchup, S. Gupta, J. Omhover, W. Qin, L. Vermeer, P. Dimitriev. “Diagnosing Sample Ratio Mismatch in Online Controlled Experiments” In review at International Data Science Conference 2019

Challenge 3: How random is random? Root out potential biases Causality will only be established if the split is random. If there is any hidden bias, it can effect your results. Eg: carry-over effect Solution A/A testing Seedfinder Observed carry-over effect after a Bing experiment. cf.: R. Kohavi, A. Deng, B. Frasca, R. Longbotham, T. Walker, Y. Xu, “Trustworthy Online Controlled Experiments: Five Puzzling Outcomes Explained" in KDD 2012. Observed differences between two A/A groups 1 million randomizations

Challenge 4: The whole is greater than the sum of its parts Challenge 4: The whole is greater than the sum of its parts. Isolate treatments and detect interactions Experiment 1: change the black font to blue. Experiment 2: change the background around ads from white to blue Solution: Isolated numberlines Interaction detection user (uid1) E1 E2 E3 About a dozen interactions per year in Bing, with over 10,000 experiments run

Challenge 4: The whole is greater than the sum of its parts Challenge 4: The whole is greater than the sum of its parts. Isolate treatments and detect interactions Experiment 1: change the black font to blue. Experiment 2: change the background around ads from white to blue Solution: Isolated numberlines Interaction detection T2 T1 C1 C2 (M |E1, T2) = ? (M | E1,C2) About a dozen interactions per year in Bing, with over 10,000 experiments run Complexity= O(#metrics*#experiments^2) Need to control for type I errors (e.g. Bonferroni Correction)

Speller movement was only detected on the triggered scorecard Challenge 5: Power-up! Increase the statistical power of your experiment If I have no feature (A/A) will I have any stat-sig movement in the scorecard? Type 1 (false positive) and Type 2 (false negative)errors. How can we avoid them? Solution: Increase Power Triggering and counterfactual logging Get more users: experimental design Reduce variance: CUPED / VR metrics Also see: R. Kohavi, “Triggering”. Speller movement was only detected on the triggered scorecard

p-values from standard and variance-reduced metric Challenge 5: Power-up! Increase the statistical power of your experiment If I have no feature (A/A) will I have any stat-sig movement in the scorecard? Type 1 (false positive) and Type 2 (false negative)errors. How can we avoid them? Solution: Increase Power Triggering and counterfactual logging Get more users: experimental design Reduce variance: CUPED / VR metrics p-values from standard and variance-reduced metric cf. A. Deng, Y. Xu, R. Kohavi, T. Walker. “Improving the Sensitivity of Online Controlled Experiments by Utilizing Pre-Experiment Data” in WSDM 2013.

20 metrics in 120 A/A experiments combined. Challenge 6: Keep metrics healthy Implement and maintain correct metrics Bad metrics won’t tell you anything or they will lead you in the wrong direction. Solutions: Overall evaluation criterion (OEC): easy to move for the wrong reasons, hard to move for the right reasons Health checks: e.g. p-value uniformity Awareness of special cases levels of aggregation variance calculation (standard vs. Delta method) percentile metrics Also see: R. Kohavi, “The Overall Evaluation Criterion (OEC)”. 20 metrics in 120 A/A experiments combined. A. Deng, J. Lu, J. Litz. “Trustworthy Analysis of Online A/B Tests: Pitfalls, challenges and solutions” in WSDM 2017.

Drill-down metrics provide insights. Challenge 7: Dig deep Provide insights, be meticulous, investigate the unexpected It’s easy to misuse the power of a well built experimentation platform to draw incorrect inferences. (e.g. green jelly beans linked to acne (p-hacking)) “ Any figure that looks interesting or different is usually wrong.” --Twyman's law Solutions: solid statistical understanding and education effort (p-values) segments provide automated insights (e.g. segments of interest, movements at different agg level, A/A tagging, low sample size) validate primary effects, then analyze your secondary Drill-down metrics provide insights. If you run an experiment to show around 10% more ads on the page, you may be tempted to look straight at the revenue numbers.

Challenge 7: Dig deep Provide insights, be meticulous, investigate the unexpected It’s easy to misuse the power of a well built experimentation platform to draw incorrect inferences. (e.g. green jelly beans linked to acne (p-hacking)) “ Any figure that looks interesting or different is usually wrong.” --Twyman's law Solutions: solid statistical understanding and education effort (p-values) segments provide automated insights (e.g. segments of interest, movements at different agg level, A/A tagging, low sample size) validate primary effects, then analyze your secondary Segmentation by date may trigger an automated segment of interest notification. If you run an experiment to show around 10% more ads on the page, you may be tempted to look straight at the revenue numbers.

Challenge 8: The benevolent dictator protect your users As scale of the product and of the experimentation service grows, the potential for harm can be significant! If you have to kiss a lot of frogs to find a prince, find more frogs and kiss them faster and faster -- Mike Moran, Do it Wrong Quickly Solutions: fast auto-detection, alerting and shutdown the challenge is to be fast which requires good engineering and near real time analytics. But beware, data can be noisy at first. start small: “insider” rings, staged rollouts. But beware of Simpson’s paradox. Alert for high number of crashes in Office.

Challenge 9: Eat exotic foods Hard but pretty cool scenarios Network affects (eg. @ mentions in MS Word comments) How can you maintain a clean control group without creating user DSAT? New approaches to randomization. Novel analysis to eliminate the cross-treatment effect. Meta-level experiments. Sometimes you’ll need to accept defeat. Enterprise Level Experimentation Metric design challenges (power, variance) Experiment design challenges (randomization, biases) @Mention feature in MsWord deals with network affects

Challenge 9: Eat exotic foods Hard but pretty cool scenarios Network affects (eg. @ mentions in MS Word comments) How can you maintain a clean control group without creating user DSAT? New approaches to randomization. Novel analysis to eliminate the cross-treatment effect. Meta-level experiments. Sometimes you’ll need to accept defeat. Enterprise Level Experimentation Metric design challenges (power, variance) Experiment design challenges (randomization, biases) Distribution of tenant size cf. S. Liu, A. Fabijan, M. Furchtgott, S. Gupta, P. Janowski, W. Qin, P. Dimitriev. “Enterprise Level Controlled Experiments at Scale” in review at SEAA 2019.

Challenge 10: The Cultural Challenge Why people/orgs avoid controlled experiments Some believe it threatens their job as decision makers. It may seem like poor performance if an idea you had is proven wrong. It’s easier to declare success when the feature launches. „We know what we’re doing” It is difficult to get a man to understand something when his salary depends upon his not understanding it. -- Upton Sinclair HiPPO = Highest Paid Person’s Opinion. Hippos kill more humans than any other (non-human) land mammal. Listen to the customers and don’t let the HiPPO kill good ideas

Microsoft Confidential EXP website: https://exp-platform.com/ Microsoft Confidential