1 The Quest for the Optimal Experiment RecSys 10-06-14.

Slides:



Advertisements
Similar presentations
Data: Quantitative (Histogram, Stem & Leaf, Boxplots) versus Categorical (Bar or Pie Chart) Boxplots: 5 Number Summary, IQR, Outliers???, Comparisons.
Advertisements

Industry Issues: Dataset Preparation for Time to Event Analysis Davis Gates Schering Plough Research Institute.
YOU CANT RECYCLE WASTED TIME Victoria Hinkson. EXPERIMENT #1 :
Standardized Scales.
5.1 and 5.2 Review AP Statistics.
Module 36: Correlation Pitfalls Effect Size and Correlations Larger sample sizes require a smaller correlation coefficient to reach statistical significance.
W. Feng, “A Long-term Study of a Popular MMORPG", NetGames 2007, Sept , A Long-term Study of a Popular MMORPG Wu-chang Feng Debanjan Saha David.
1. Estimation ESTIMATION.
Review: What influences confidence intervals?
Sampling. The Logic of Sampling Virtually ALL social research entails “sampling,” including approaches that don’t engage human subjects. “Probability”
Behavioural Science II Week 1, Semester 2, 2002
Who and How And How to Mess It up
Forecasting.
Lecture 10 Psyc 300A. Types of Experiments Between-Subjects (or Between- Participants) Design –Different subjects are assigned to each level of the IV.
Chapter 13 Forecasting.
Sampling and Experimental Control Goals of clinical research is to make generalizations beyond the individual studied to others with similar conditions.
1 Homework  What’s important (i.e., this will be used in determining your grade): Finding features that make a difference You should expect to do some.
Impact Evaluation Session VII Sampling and Power Jishnu Das November 2006.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Data and Data Collection Quantitative – Numbers, tests, counting, measuring Fundamentally--2 types of data Qualitative – Words, images, observations, conversations,
Research Design Interactive Presentation Interactive Presentation
Elements of Multiple Regression Analysis: Two Independent Variables Yong Sept
Chapter 1: Introduction to Statistics
Chapter 3 An Overview of Quantitative Research
EVAL 6970: Cost Analysis for Evaluation Dr. Chris L. S. Coryn Nick Saxton Fall 2014.
An Analysis of WoW Players’ Game Hours Matt Ross, Christian Ebinger, Anthony Morgan.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
PSY2004 Research Methods PSY2005 Applied Research Methods Week Eleven Stephen Nunn.
Some terms Parametric data assumptions(more rigorous, so can make a better judgment) – Randomly drawn samples from normally distributed population – Homogenous.
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.
Today: Our process Assignment 3 Q&A Concept of Control Reading: Framework for Hybrid Experiments Sampling If time, get a start on True Experiments: Single-Factor.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Chapter 1 Measurement, Statistics, and Research. What is Measurement? Measurement is the process of comparing a value to a standard Measurement is the.
1 Paired Differences Paired Difference Experiments 1.Rationale for using a paired groups design 2.The paired groups design 3.A problem 4.Two distinct ways.
Statistical Reasoning for everyday life Intro to Probability and Statistics Mr. Spering – Room 113.
MT 219 Marketing Unit Two The Marketing Environment and Marketing Research Dr. Bea Bourne Note: This seminar will be recorded by the instructor.
Evaluating Impacts of MSP Grants Hilary Rhodes, PhD Ellen Bobronnikov February 22, 2010 Common Issues and Recommendations.
Sampling and Probability Chapter 5. Sampling & Elections >Problems with predicting elections: Sample sizes are too small Samples are biased (also tied.
1 Chapter 8 Introduction to Hypothesis Testing. 2 Name of the game… Hypothesis testing Statistical method that uses sample data to evaluate a hypothesis.
HYPOTHESIS TESTING Null Hypothesis and Research Hypothesis ?
Hypothesis Testing. Why do we need it? – simply, we are looking for something – a statistical measure - that will allow us to conclude there is truly.
© 2006 The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Sampling Chapter Six.
McMillan Educational Research: Fundamentals for the Consumer, 6e © 2012 Pearson Education, Inc. All rights reserved. Educational Research: Fundamentals.
Review I A student researcher obtains a random sample of UMD students and finds that 55% report using an illegally obtained stimulant to study in the past.
Two-Way (Independent) ANOVA. PSYC 6130A, PROF. J. ELDER 2 Two-Way ANOVA “Two-Way” means groups are defined by 2 independent variables. These IVs are typically.
Review ~ Data & Displays Chapters 1-6 Key Terms: –Data : Categorical versus Quantitative –Frequency Table/Contingency Table/Marginal Distribution/Conditional.
Reliability: Introduction. Reliability Session 1.Definitions & Basic Concepts of Reliability 2.Theoretical Approaches 3.Empirical Assessments of Reliability.
Aim: What factors must we consider to make an experimental design?
1 Publishing Example - Subscription Based Disguised Client Data Overview and Outcome Outcome: Tenure of new Customer attriting in First 12 m Outcome: Observing.
HYPOTHESIS TESTING Null Hypothesis and Research Hypothesis ?
MT 219 Marketing Unit Two The Marketing Environment and Marketing Research Note: This seminar will be recorded by the instructor.
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Describe how reaching and grasping abilities develop in the first year of life.
1 Chapter 11 Understanding Randomness. 2 Why Random? What is it about chance outcomes being random that makes random selection seem fair? Two things:
Producing Data 1.
1.3 Experimental Design. What is the goal of every statistical Study?  Collect data  Use data to make a decision If the process to collect data is flawed,
DEMAND FORECASTING & MARKET SEGMENTATION. Why demand forecasting?  Planning and scheduling production  Acquiring inputs  Making provision for finances.
Scientific Method Vocabulary Observation Hypothesis Prediction Experiment Variable Experimental group Control group Data Correlation Statistics Mean Distribution.
Methods of Presenting and Interpreting Information Class 9.
REGRESSION (R2).
Developing the Sampling Plan
JUS 510 Competitive Success/snaptutorial.com
JUS 510 Education for Service-- snaptutorial.com.
BUS 308 Education for Service-- snaptutorial.com
JUS 510 Teaching Effectively-- snaptutorial.com
Significance Tests: The Basics
Evidence Based Practice
Leverage Real-Time Payments Intelligence to Identify and Keep Great Customers March 2019 Parag Patil.
How to Effectively Communicate Commonly misinterpreted Statistical terms in experimentation AMAZON IPC LAB.
Presentation transcript:

1 The Quest for the Optimal Experiment RecSys

‘Science & Algorithms’ at Netflix 2 Causation Correlation  Experimentation Science, methodology, and statistical analysis of experiments  Algorithm R&D Mathematical algorithms that get embedded into automated processes, such as our recommendation system  Predictive models Standalone mathematical models to support decision making (e.g. title demand prediction)

3 Numbers shown in this presentation are not representative of Netflix’s overall metric values

Netflix Experimentation: Common  “Product” is a set of controlled, randomized experiments, many running at once  Experiment in all areas  Plenty of rigor and attention around statistics, metrics, analysis 4

Netflix Experimentation: Distinctive  Core to culture (not just process)  Curated approach  Decisions not automated  Scrutiny of each test (and by many people)  Paying customers who are always logged in  Monthly subscription  Tests last several months  Sampling (test allocation) of new members can take weeks or even months  Many devices 5

Retention is our core metric (OEC)  Continually improve member enjoyment 6

Streaming Hours is our main engagement metric

8 Probability of retaining at each future billing cycle based on streaming S hours at N days of tenure Total hours consumed during N days of membership Retention Streaming measurement: Streaming score

Streaming measurement: KS visual & Mann Whitney u test statistic KS Test statistic

Streaming measurement: Thresholds with z-tests for proportions 10

Much experimentation on the recommender system  Row selection  Video ranking  Video-video similarity  User-user similarity  Search recommendations  Popularity vs personalization  Diversity  Novelty/Freshness  Evidence

Sample and Subject Purity 12

Same test, different populations 13

Who should Netflix sample? Geography  Global  US  International  Region-specific Tenure  1 month (free trial)  2-6 months  7+ months Classes of experience with Netflix  Signups who are not rejoining members  Rejoining members  Existing members (any tenure)  Existing members who are beyond their free trial  Newly activating a device 14

Two considerations 1.For whom/what do you want to optimize? 2.Who will experience the winning test experience that gets launched? 15

“New members” by country region 16 Time

Membership by tenure 17 Longer tenure Medium tenure Free trial Time

Hard to impact long-tenured members 18 Cancel Rate Long tenureMedium tenureFree trial

Current favored samples in algorithm testing  Global signups who are not rejoining within a year  Secondarily:  US existing members who are beyond their free trial  International (non-US) existing members who are beyond their free trial 19

Addressing Sampling Bias  Stratified sampling on attributes that are:  Correlated with core metric  Independent of the test treatment  Regression tests for any systematic randomization process  Bias monitoring for each test’s sample  Large sample sizes  Re-testing  Good judgment to recognize that the “story” makes sense 20

In the words of Nate Silver 21 On predicting the 2008 recession in a world of noisy data and dependent variables: Not only was Hatzius’s forecast correct, but it was also right for the right reasons, explaining the causes of the collapse and anticipating the effects. Hatzius refers to this chain of cause and effect as a “story”… In contrast, if you just look at the economy as a series of variables and equations without any underlying structure, you are almost certain to mistake noise for a signal… The Signal and the Noise: Why so Many Predictions Fail – but Some Don’t by Nate Silver

Short- versus long-term engagement metrics 22

Short-term metrics we consider  Daily cancel requests  Daily streaming hours  Daily visits  Session length  Failed sessions (no play)  “Take rates” (CTR where the clicks is to play)  Page-level  Row-level  Title-level 23

Statistically significant differences in churn rarely stabilize until after Day Test Duration

Short-term metrics we consider  Daily cancel requests  Daily streaming hours  Daily visits  Session length  Failed sessions (no play)  “Take rates” (CTR where the clicks is to play)  Page-level  Row-level  Title-level 25

26 How well do your short-term metrics correlate with your OEC, and how much improvement do you see in that correlation if you increase the time interval?

Streaming signal that appears over time 27 1 Week1 Month2 Months

Or disappears over time 28 1 Week1 Month2 Months

Ability to predict 4-month retention using streaming hours improves with longer- term data 29

Key Takeaways  Exercise rigor in selecting the population to sample; representative of:  The population you want to optimize for  The population that will receive the experience if launched  Remain open-minded about changing the target population as business shifts occur  Address bias, ongoing  Know and apply the time duration necessary for your OEC to stabilize  Additional short-term metrics need to have sufficient duration to correlate well with your OEC 30