Download presentation
Presentation is loading. Please wait.
Published byJoleen Bradley Modified over 9 years ago
1
1 The Quest for the Optimal Experiment RecSys 10-06-14
2
‘Science & Algorithms’ at Netflix 2 Causation Correlation Experimentation Science, methodology, and statistical analysis of experiments Algorithm R&D Mathematical algorithms that get embedded into automated processes, such as our recommendation system Predictive models Standalone mathematical models to support decision making (e.g. title demand prediction)
3
3 Numbers shown in this presentation are not representative of Netflix’s overall metric values
4
Netflix Experimentation: Common “Product” is a set of controlled, randomized experiments, many running at once Experiment in all areas Plenty of rigor and attention around statistics, metrics, analysis 4
5
Netflix Experimentation: Distinctive Core to culture (not just process) Curated approach Decisions not automated Scrutiny of each test (and by many people) Paying customers who are always logged in Monthly subscription Tests last several months Sampling (test allocation) of new members can take weeks or even months Many devices 5
6
Retention is our core metric (OEC) Continually improve member enjoyment 6
7
Streaming Hours is our main engagement metric
8
8 Probability of retaining at each future billing cycle based on streaming S hours at N days of tenure Total hours consumed during N days of membership Retention Streaming measurement: Streaming score
9
Streaming measurement: KS visual & Mann Whitney u test statistic KS Test statistic
10
Streaming measurement: Thresholds with z-tests for proportions 10
11
Much experimentation on the recommender system Row selection Video ranking Video-video similarity User-user similarity Search recommendations Popularity vs personalization Diversity Novelty/Freshness Evidence
12
Sample and Subject Purity 12
13
Same test, different populations 13
14
Who should Netflix sample? Geography Global US International Region-specific Tenure 1 month (free trial) 2-6 months 7+ months Classes of experience with Netflix Signups who are not rejoining members Rejoining members Existing members (any tenure) Existing members who are beyond their free trial Newly activating a device 14
15
Two considerations 1.For whom/what do you want to optimize? 2.Who will experience the winning test experience that gets launched? 15
16
“New members” by country region 16 Time
17
Membership by tenure 17 Longer tenure Medium tenure Free trial Time
18
Hard to impact long-tenured members 18 Cancel Rate Long tenureMedium tenureFree trial
19
Current favored samples in algorithm testing Global signups who are not rejoining within a year Secondarily: US existing members who are beyond their free trial International (non-US) existing members who are beyond their free trial 19
20
Addressing Sampling Bias Stratified sampling on attributes that are: Correlated with core metric Independent of the test treatment Regression tests for any systematic randomization process Bias monitoring for each test’s sample Large sample sizes Re-testing Good judgment to recognize that the “story” makes sense 20
21
In the words of Nate Silver 21 On predicting the 2008 recession in a world of noisy data and dependent variables: Not only was Hatzius’s forecast correct, but it was also right for the right reasons, explaining the causes of the collapse and anticipating the effects. Hatzius refers to this chain of cause and effect as a “story”… In contrast, if you just look at the economy as a series of variables and equations without any underlying structure, you are almost certain to mistake noise for a signal… The Signal and the Noise: Why so Many Predictions Fail – but Some Don’t by Nate Silver
22
Short- versus long-term engagement metrics 22
23
Short-term metrics we consider Daily cancel requests Daily streaming hours Daily visits Session length Failed sessions (no play) “Take rates” (CTR where the clicks is to play) Page-level Row-level Title-level 23
24
Statistically significant differences in churn rarely stabilize until after Day 45 24 Test Duration
25
Short-term metrics we consider Daily cancel requests Daily streaming hours Daily visits Session length Failed sessions (no play) “Take rates” (CTR where the clicks is to play) Page-level Row-level Title-level 25
26
26 How well do your short-term metrics correlate with your OEC, and how much improvement do you see in that correlation if you increase the time interval?
27
Streaming signal that appears over time 27 1 Week1 Month2 Months
28
Or disappears over time 28 1 Week1 Month2 Months
29
Ability to predict 4-month retention using streaming hours improves with longer- term data 29
30
Key Takeaways Exercise rigor in selecting the population to sample; representative of: The population you want to optimize for The population that will receive the experience if launched Remain open-minded about changing the target population as business shifts occur Address bias, ongoing Know and apply the time duration necessary for your OEC to stabilize Additional short-term metrics need to have sufficient duration to correlate well with your OEC 30
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.