Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 The Quest for the Optimal Experiment RecSys 10-06-14.

Similar presentations


Presentation on theme: "1 The Quest for the Optimal Experiment RecSys 10-06-14."— Presentation transcript:

1 1 The Quest for the Optimal Experiment RecSys 10-06-14

2 ‘Science & Algorithms’ at Netflix 2 Causation Correlation  Experimentation Science, methodology, and statistical analysis of experiments  Algorithm R&D Mathematical algorithms that get embedded into automated processes, such as our recommendation system  Predictive models Standalone mathematical models to support decision making (e.g. title demand prediction)

3 3 Numbers shown in this presentation are not representative of Netflix’s overall metric values

4 Netflix Experimentation: Common  “Product” is a set of controlled, randomized experiments, many running at once  Experiment in all areas  Plenty of rigor and attention around statistics, metrics, analysis 4

5 Netflix Experimentation: Distinctive  Core to culture (not just process)  Curated approach  Decisions not automated  Scrutiny of each test (and by many people)  Paying customers who are always logged in  Monthly subscription  Tests last several months  Sampling (test allocation) of new members can take weeks or even months  Many devices 5

6 Retention is our core metric (OEC)  Continually improve member enjoyment 6

7 Streaming Hours is our main engagement metric

8 8 Probability of retaining at each future billing cycle based on streaming S hours at N days of tenure Total hours consumed during N days of membership Retention Streaming measurement: Streaming score

9 Streaming measurement: KS visual & Mann Whitney u test statistic KS Test statistic

10 Streaming measurement: Thresholds with z-tests for proportions 10

11 Much experimentation on the recommender system  Row selection  Video ranking  Video-video similarity  User-user similarity  Search recommendations  Popularity vs personalization  Diversity  Novelty/Freshness  Evidence

12 Sample and Subject Purity 12

13 Same test, different populations 13

14 Who should Netflix sample? Geography  Global  US  International  Region-specific Tenure  1 month (free trial)  2-6 months  7+ months Classes of experience with Netflix  Signups who are not rejoining members  Rejoining members  Existing members (any tenure)  Existing members who are beyond their free trial  Newly activating a device 14

15 Two considerations 1.For whom/what do you want to optimize? 2.Who will experience the winning test experience that gets launched? 15

16 “New members” by country region 16 Time

17 Membership by tenure 17 Longer tenure Medium tenure Free trial Time

18 Hard to impact long-tenured members 18 Cancel Rate Long tenureMedium tenureFree trial

19 Current favored samples in algorithm testing  Global signups who are not rejoining within a year  Secondarily:  US existing members who are beyond their free trial  International (non-US) existing members who are beyond their free trial 19

20 Addressing Sampling Bias  Stratified sampling on attributes that are:  Correlated with core metric  Independent of the test treatment  Regression tests for any systematic randomization process  Bias monitoring for each test’s sample  Large sample sizes  Re-testing  Good judgment to recognize that the “story” makes sense 20

21 In the words of Nate Silver 21 On predicting the 2008 recession in a world of noisy data and dependent variables: Not only was Hatzius’s forecast correct, but it was also right for the right reasons, explaining the causes of the collapse and anticipating the effects. Hatzius refers to this chain of cause and effect as a “story”… In contrast, if you just look at the economy as a series of variables and equations without any underlying structure, you are almost certain to mistake noise for a signal… The Signal and the Noise: Why so Many Predictions Fail – but Some Don’t by Nate Silver

22 Short- versus long-term engagement metrics 22

23 Short-term metrics we consider  Daily cancel requests  Daily streaming hours  Daily visits  Session length  Failed sessions (no play)  “Take rates” (CTR where the clicks is to play)  Page-level  Row-level  Title-level 23

24 Statistically significant differences in churn rarely stabilize until after Day 45 24 Test Duration

25 Short-term metrics we consider  Daily cancel requests  Daily streaming hours  Daily visits  Session length  Failed sessions (no play)  “Take rates” (CTR where the clicks is to play)  Page-level  Row-level  Title-level 25

26 26 How well do your short-term metrics correlate with your OEC, and how much improvement do you see in that correlation if you increase the time interval?

27 Streaming signal that appears over time 27 1 Week1 Month2 Months

28 Or disappears over time 28 1 Week1 Month2 Months

29 Ability to predict 4-month retention using streaming hours improves with longer- term data 29

30 Key Takeaways  Exercise rigor in selecting the population to sample; representative of:  The population you want to optimize for  The population that will receive the experience if launched  Remain open-minded about changing the target population as business shifts occur  Address bias, ongoing  Know and apply the time duration necessary for your OEC to stabilize  Additional short-term metrics need to have sufficient duration to correlate well with your OEC 30


Download ppt "1 The Quest for the Optimal Experiment RecSys 10-06-14."

Similar presentations


Ads by Google