Content Recommendation on Y! sites Deepak Agarwal Stanford Info Seminar 17 th Feb, 2012.

Slides:

Advertisements

Similar presentations

Personalized Recommendation on Dynamic Content Using Predictive Bilinear Models Wei ChuSeung-Taek Park WWW 2009 Audience Science Yahoo! Labs.

Advertisements

Google News Personalization: Scalable Online Collaborative Filtering

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Temporal Query Log Profiling to Improve Web Search Ranking Alexander Kotov (UIUC) Pranam Kolari, Yi Chang (Yahoo!) Lei Duan (Microsoft)

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

Optimizing search engines using clickthrough data

Large Scale Machine Learning for Content Recommendation and Computational Advertising Deepak Agarwal, Director, Machine Learning and Relevance Science,

1.Accuracy of Agree/Disagree relation classification. 2.Accuracy of user opinion prediction. 1.Task extraction performance on Bing web search log with.

- 1 - Intro to Content Optimization Yury Lifshits. Yahoo! Research Largely based on slides by Bee-Chung Chen, Deepak Agarwal & Pradheep Elango.

The Roles of Uncertainty and Randomness in Online Advertising Ragavendran Gopalakrishnan Eric Bax Raga Gopalakrishnan 2 nd Year Graduate Student (Computer.

Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.

Catching the Drift: Learning Broad Matches from Clickthrough Data Sonal Gupta, Mikhail Bilenko, Matthew Richardson University of Texas at Austin, Microsoft.

Contextual Advertising by Combining Relevance with Click Feedback D. Chakrabarti D. Agarwal V. Josifovski.

1 Learning User Interaction Models for Predicting Web Search Result Preferences Eugene Agichtein Eric Brill Susan Dumais Robert Ragno Microsoft Research.

Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.

Bandits for Taxonomies: A Model-based Approach Sandeep Pandey Deepak Agarwal Deepayan Chakrabarti Vanja Josifovski.

Recommender Systems Aalap Kohojkar Yang Liu Zhan Shi March 31, 2008.

Evaluating Search Engine

Spring INTRODUCTION There exists a lot of methods used for identifying high risk locations or sites that experience more crashes than one would.

Mortal Multi-Armed Bandits Deepayan Chakrabarti,Yahoo! Research Ravi Kumar,Yahoo! Research Filip Radlinski, Microsoft Research Eli Upfal,Brown University.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)

Exploration & Exploitation in Adaptive Filtering Based on Bayesian Active Learning Yi Zhang, Jamie Callan Carnegie Mellon Univ. Wei Xu NEC Lab America.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Personalization in Local Search Personalization of Content Ranking in the Context of Local Search Philip O’Brien, Xiao Luo, Tony Abou-Assaleh, Weizheng.

Particle Filtering in Network Tomography

Fair Allocation with Succinct Representation Azarakhsh Malekian (NWU) Joint Work with Saeed Alaei, Ravi Kumar, Erik Vee UMDYahoo! Research.

ICML’11 Tutorial: Recommender Problems for Web Applications Deepak Agarwal and Bee-Chung Chen Yahoo! Research.

1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,

Adaptive News Access Daniel Billsus Presented by Chirayu Wongchokprasitti.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Understanding and Predicting Graded Search Satisfaction Tang Yuk Yu 1.

Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.

Ramakrishnan Srikant Sugato Basu Ni Wang Daryl Pregibon 1.

Evaluation Methods and Challenges. 2 Deepak Agarwal & Bee-Chung ICML’11 Evaluation Methods Ideal method –Experimental Design: Run side-by-side.

UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.

©2015 Apigee Corp. All Rights Reserved. Preserving signal in customer journeys Joy Thomas, Apigee Jagdish Chand, Visa.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

- 1 - Recommender Problems for Content Optimization Deepak Agarwal Yahoo! Research MMDS, June 15 th, 2010 Stanford, CA.

Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Jun Li, Peng Zhang, Yanan Cao, Ping Liu, Li Guo Chinese Academy of Sciences State Grid Energy Institute, China Efficient Behavior Targeting Using SVM Ensemble.

BCS547 Neural Decoding.

Lecture 2: Statistical learning primer for biologists

Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun

Google News Personalization Big Data reading group November 12, 2007 Presented by Babu Pillai.

Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

1 Raghu Ramakrishnan Research Fellow Chief Scientist, Audience and Cloud Computing Yahoo! Purple Clouds:

Regression Based Latent Factor Models Deepak Agarwal Bee-Chung Chen Yahoo! Research KDD 2009, Paris 6/29/2009.

Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

A Connectivity-Based Popularity Prediction Approach for Social Networks Huangmao Quan, Ana Milicic, Slobodan Vucetic, and Jie Wu Department of Computer.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Bayesian Optimization. Problem Formulation Goal  Discover the X that maximizes Y  Global optimization Active experimentation  We can choose which values.

A Case Study of Behavior-driven Conjoint Analysis on Yahoo

Adaptive, Personalized Diversity for Visual Discovery

Intro to Content Optimization

Bandits for Taxonomies: A Model-based Approach

Author: Kazunari Sugiyama, etc. (WWW2004)

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Recommender Systems Copyright: Dietmar Jannah, Markus Zanker and Gerhard Friedrich (slides based on their IJCAI talk „Tutorial: Recommender Systems”)

Presentation transcript:

Content Recommendation on Y! sites Deepak Agarwal Stanford Info Seminar 17 th Feb, 2012

2 Deepak Recommend applications Recommend search queries Recommend news article Recommend packages: Image Title, summary Links to other Y! pages Pick 4 out of a pool of K K = 20 ~ 40 Dynamic Routes traffic other pages

3 Deepak Objective Serve content items to users to maximize click-through rates More clicks leads to more pageviews on the Yahoo! network We can also consider weighted versions of CTR or multiple objectives More on this later

4 Deepak Rest of the talk CTR ESTIMATION –Serving estimated most popular (EMP) –Personalization Based on user features and past activities Multi-Objective Optimization –Recommendation to optimize multiple scores like CTR, ad- revenue, time-spent, ….

5 Deepak 4 years ago when we started …. Editorial placement, no Machine Learning We built logistic regression based on user and item features: Did not work Simple counting models Collect data every 5 minutes, count clicks and views. This worked but several nuances F1 F2F3F4 Today module

6 Deepak Simple algorithm we began with Initialize CTR of every new article to some high number –This ensures a new article has a chance of being shown Show the most popular CTR article (randomly breaking ties) for each user visit in the next 5 minutes Re-compute the global article CTRs after 5 minutes Show the new most popular for next 5 minutes Keep updating article popularity over time Quite intuitive. Did not work! Performance was bad. Why?

7 Deepak Bias in the data: Article CTR decays over time This is what an article CTR curve looked like We were computing CTR by cumulating clicks and views. –Missing decay dynamics? Dynamic growth model using a Kalman filter. –New model tracked decay very well, performance still bad And the plot thickens, my dear Watson!

8 Deepak Explanation of decay: Repeat exposure Repeat Views → CTR Decay

9 Deepak Clues to solve the mystery User population seeing an article for the first time have higher CTR, those being exposed have lower – but we use the same CTR estimate for all ? Other sources of bias? How to adjust for them? A simple idea to remove bias –Display articles at random to a small randomly chosen population Call this the Random bucket Randomization removes bias in data –(Charles Pierce,1877; R.A. Fisher, 1935)

10 Deepak CTR of same article with/without randomization Serving bucket Random bucket Decay Time-of-Day

11 Deepak CTR of articles in Random bucket Track Unbiased CTR, but it is dynamic. Simply counting clicks and views still didn’t won’t work well.

12 Deepak New algorithm Create a small random bucket which selects one out of K existing articles at random for each user visit Learn unbiased article popularity using random bucket data by tracking (through a non-linear Kalman filter) Serve the most popular article in the serving bucket Override rules: Diversity, voice,….

13 Deepak Other advantages The random bucket ensures continuous flow of data for all articles, we quickly discard bad articles and converge to the best one This saved the day, the project was a success! –Initial click-lift 40% (Agarwal et al. NIPS 08) –after 3 years it is 200+% (fully deployed on Yahoo! front page and elsewhere on Yahoo!), we are still improving the system Improvements both due to algorithms & feedback to humans –Solutions “platformized” and rolled out to many Y! properties

14 Deepak Time series Model: Kalman filter Dynamic Gamma-Poisson: click-rate evolves over time in a multiplicative fashion Estimated Click-rate distribution at time t+1 –Prior mean: –Prior variance: High CTR items more adaptive

15 Deepak Updating the parameters at time t+1 Fit a Gamma distribution to match the prior mean and prior variance at time t Combine this with Poisson likelihood at time t to get the posterior mean and posterior variance at time t+1 –Combining Poisson with Gamma is easy, hence we fit a Gamma distribution to match moments

16 Deepak More Details Agarwal, Chen, Elango, Ramakrishnan, Motgi, Roy, Zachariah. Online models for Content Optimization, NIPS 2008 Agarwal, Chen, Elango. Spatio-Temporal Models for Estimating Click-through Rate, WWW 2009

17 Deepak Lessons learnt It is ok to start with simple models that learn a few things, but beware of the biases inherent in your data –E.g. of things gone wrong Learning article popularity –Data used from 5am-8am pst, served from 10am-1pm pst –Bad idea if article popular on the east, not on the west Randomization is a friend, use it when you can. Update the models fast, this may reduce the bias –User visit patterns close in time are similar Can we be more economical in our randomization?

18 Deepak Multi-Armed Bandits Consider a slot machine with two arms p2p2 (unknown payoff probabilities) The gambler has 1000 plays, what is the best way to experiment ? (to maximize total expected reward) This is called the “bandit” problem, have been studied for a long time. Optimal solution: Play the arm that has maximum potential of being good p 1 >

19 Deepak Recommender Problems: Bandits? Two Items: Item 1 CTR= 2/100 ; Item 2 CTR= 250/10000 –Greedy: Show Item 2 to all; not a good idea –Item 1 CTR estimate noisy; item could be potentially better Invest in Item 1 for better overall performance on average This is also referred to as Explore/exploit problem –Exploit what is known to be good, explore what is potentially good CTR Probability density Article 2 Article 1

20 Deepak Bayes optimal solution in next 5 mins 2 articles, 1 uncertain Uncertainty in CTR: pseudo #views

21 Deepak More Details on the Bayes Optimal Solution Agarwal, Chen, Elango. Explore-Exploit Schemes for Web Content Optimization, ICDM 2009 –(Best Research Paper Award)

22 Deepak Recommender Problems: bandits in a casino Items are arms of bandits, ratings/CTRs are unknown payoffs –Goal is to converge to the best CTR item quickly –But this assumes one size fits all (no personalization) Personalization –Each user is a separate bandit –Hundreds of millions of bandits (huge casino) Rich literature (several tutorials on the topic) –Clever/adaptive randomization –Our random bucket is a solution (epsilon-greedy) –For highly personalized/large content pool/small traffic: UCB (mean + k.std), Thompson sampling (random draw from posterior) are good practical solutions. Many opportunities for novel research in this area

23 Deepak Personalization Recommend articles: Image Title, summary Links to other pages For each user visit, Pick 4 out of a pool of K Routes traffic to other pages 1 234

24 Deepak DATA article j with User i with user features x it (demographics, browse history, search history, …) item features x j (keywords, content categories,...) (i, j) : response y ij visits Algorithm selects (rating or click/no-click)

25 Deepak Types of user features Demographics, geo: Declared –We did not find them to be useful in front-page application Browse behavior based on activity on Y! network ( x it ) –Previous visits to property, search, ad views, clicks,.. –This is useful for the front-page application Previous clicks on the module ( u it ) –Extremely useful for heavy users Obtained via matrix factorization

26 Deepak Approach: Online logistic with E/E Build a per item online logistic regression For item j, Coefficients for item j estimated via online logistic regression Explore/exploit for personalized recommendation –epsilon-greedy and UCB perform well for Y! front-page application

27 Deepak Bipartite Graph completion problem Users Articles no-click click Observed Graph Users Articles Predicted CTR Graph

28 Deepak User profile to capture historical module behavior i j uiui vjvj User popularity Item popularity

29 Deepak Estimating granular latent factors via shrinkage If user/item have high degree, good estimates of factors available else we need back-off Shrinkage: We use user/item features through regressions regression weight matrix user/item-specific correction term (learnt from data)

30 Deepak Estimates with shrinkage For new user/article, factor estimates based on features For old user/article, factor estimates Linear combination of regression and user “ratings”

31 Deepak Estimating the Regression function via EM Maximize Integral cannot be computed in closed form, approximated via Gibbs Sampling

32 Deepak Scaling to large data: Map-Reduce Randomly partition users in the Map Run separate models in the reducers on each partition Care is taken to initialize each partition model with same values, constraints are put on model parameters to ensure the model is identifiable in each partition Create ensembles by using different user partitions –Estimates of user factors in ensembles uncorrelated, averaging reduces variance

33 Deepak Data Example 1B events, 8M users, 6K articles Trained factorization offline to produce user feature u i Baseline: Online logistic without u i Overall click lift: 9.7%, Heavy users (> 10 clicks in the past): 26% Cold users (not seen in the past): 3%

34 Deepak Click-lift for heavy users

35 Deepak More Details Agarwal and Chen: Regression Based Latent Factor Models, KDD 2009

36 Deepak MULTI-OBJECTIVES BEYOND CLICKS

37 Deepak Post-click utilities Recommender EDITORIAL content Clicks on FP links influence downstream supply distribution AD SERVER PREMIUM DISPLAY (GUARANTEED) NETWORK PLUS (Non-Guaranteed) Downstream engagement (Time spent)

38 Deepak Serving Content on Front Page: Click Shaping What do we want to optimize? Usual: Maximize clicks (maximize downstream supply from FP) But consider the following –Article 1: CTR=5%, utility per click = 5 –Article 2: CTR=4.9%, utility per click=10 By promoting 2, we lose 1 click/100 visits, gain 5 utils If we do this for a large number of visits --- lose some clicks but obtain significant gains in utility? –E.g. lose 5% relative CTR, gain 20% in utility (revenue, engagement, etc)

39 Deepak How are Clicks being Shaped ? Supply distribution Changes BEFORE AFTER SHAPING can happen with respect to multiple downstream metrics (like engagement, revenue,…)

40 Deepak 40 Multi-Objective Optimization A1A1 A2A2 AnAn n articlesK properties news finance omg … … S1S1 S2S2 SmSm m user segments … CTR of user segment i on article j: p ij Time duration of i on j: d ij known p ij, d ij x ij : variables

41 Deepak 41 Multi-Objective Program  Scalarization  Linear Program

42 Deepak Pareto-optimal solution (more in KDD 2011) 42

43 Deepak Other constraints and variations We also want to ensure major properties do not lose too many clicks even if overall performance is better –Put additional constraints in the linear program

44 Deepak More Details Agarwal, Chen, Elango, Wang: Click Shaping to Optimize Multiple Objectives, KDD 2011

45 Deepak Can we do it with Advertising Revenue? Yes, but need to be careful. –Interventions can cause undesirable long-term impact –Communication between two complex distributed systems –Display advertising at Y! also sold as long-term guaranteed contracts We intervene to change supply when contract is at risk of under-delivering Research to be shared in the future

46 Deepak Summary Simple models that learn a few parameters are fine to begin with BUT beware of bias in data –Small amounts of randomization + fast model updates Clever Randomization using Explore/Exploit techniques Granular models are more effective and personalized –Using previous module activity particularly good for heavy users Considering multi-objective optimization is often important

47 Deepak Information Discovery: Content Recommendation versus Search Search –User generally has an objective in mind (strong intent) E.g. Booking a ticket to San Diego Recall is very important to finish the task Retrieving documents relevant to query important Other ways of Information Discovery –User wants to be informed about important news –User wants to learn about latest in pop music Intent is weak –Good user experience: depends on the quality of recommendations

48 Deepak Other examples: Stronger context

49 Deepak Fundamental issue: Goodness score Develop a score S(user,item,context) –Goodness of an item for a user in a given context One option (mimic search) –(user, context) is query, item is document Rank items from a content pool using relevance measure E.g. Bag of words based on user’s topical interests; bag of words for item based on landing page characteristics and other meta-data For content recommendation, query is complex –we want a better and more direct measure of user experience (relevance)

50 Deepak CTR as goodness score Scoring items based on click-rates (CTR) on item links better surrogate of user satisfaction CTR can be enhanced by incorporating other aspects that measure value of a click –E.g. How much advertising revenue does a publisher obtain? –How much time did the user spend reading the article? –What are the chances of user sharing the article?

51 Deepak Ranking items Given a CTR estimation strategy, how do we rank items? Constraints for good long-term user experience Editorial oversight Editors/journalists select items/sources that are of high quality Voice/Brand Typical content associated with a site –Some degree of relevance Do not show Hollywood celebrity gossip on serious news article –Degree of Personalization Typical user interest, session activity Approach: Recommend items to maximize CTR –subject to constraints

52 Deepak Current Research: the 3 M Approach Multi-context –User interaction data from multiple contexts Front page, My Yahoo!, Search, Y! news,… How to combine them? (KDD 2011) Multi-response –Several signals (clicks, share, tweet, comment, like/dislike) How to predict all exploiting correlations? Paper under preparation Multi-Objective –Short term objectives (proxies) to optimize that achieve long-term goals (this is not exactly mainstream machine learning but it is an important consideration)

53 Deepak Whole Page optimization K1 K2 K3 Today Module 4 slots NEWS 8 slots Trending 10 slots User covariate vector x it (includes declared and inferred) (Age=old, Finance=T, Sports=F) Goal: Display content Maximize CTR in long time-horizon

54 Deepak Collaborators Bee-Chung Chen (Yahoo! Research, CA) Liang Zhang (Yahoo! Labs, CA) Raghu Ramakrishnan (Yahoo! Fellow and VP) Xuanhui Wang (Yahoo! Labs) Rajiv Khanna (Yahoo! Labs, India) Pradheep Elango(Yahoo! Labs, CA) Engineering & Product Teams (CA)

55 Deepak Thank you !

56 Deepak Bayesian scheme, 2 intervals, 2 articles Only 2 intervals left : # visits N 0, N 1 Article 1 prior CTR p 0 ~ Gamma(α, γ) –Article 2: CTR q 0 and q 1, Var(q 0 ) = Var(q 1 ) = 0 –Assume E(p 0 ) < q 0 [else the solution is trivial] Design parameter: x (fraction of visits allocated to article 1) Let c |p 0 ~ Poisson(p 0 (xN 0 )) : clicks on article 1, interval 0. Prior gets updated to posterior: Gamma(α+c,γ+xN 0 ) Allocate visits to better article in interval 2 i.e. to item 1 iff post mean item 1 = E[p 1 | c, x] > q 1

57 Deepak Optimization Expected total number of clicks Gain(x, q 0, q 1 ) Gain from experimentation E[#clicks] if we always show the certain item x opt =argmax x Gain(x, q 0, q 1 )

58 Deepak Generalization to K articles Objective function Langrange relaxation (Whittle)

59 Deepak Test on Live Traffic 15% explore (samples to find the best article); 85% serve the “estimated” best (false convergence)