Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate Technology
Superman The Pianist Star Wars The Matrix The Godfather American Beauty Motivations Very Dislike Very Like ·... ? Genre Directors Actors Descriptions … Ordinal Regression · ·... FeaturesRatings
3... Superman The Pianist Star Wars The Matrix The Godfather American Beauty Motivations (Cont.) ·... Features Collaborative Ordinal Regression · ·... · · · · … · · Ratings ? ? ? ? ? ? ? ? ?
4 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
5 Goal: Assign ranks to objects Different from classification/regression problem Binary classification: Has only 2 labels Multi-class classification: Ignores ordering property Regression: Only deals with real outputs Ranking Problem x 1 2 x x n r y Ordinal Regression Preference Learning x n  :::  x 1  x 2  ::: x 1 £ X ¢¢¢ £ x 2 X £ ¢¢¢ £ x n ££ ¢¢¢ X 1 < 2 < ::: < r lowest rank highest rank x 1 x n x 2 R d
6 Goal: Assign ordered labels to objects Applications User preference prediction Web ranking for search engines … Ordinal Regression 1 2r... R d x 1 x 2 x n f f ( x ) b r b r ¡ 1 b 2 b 1 b 0 y n y 2 y 1
7 Common in real world problems Collaborative filtering: preference learning for multiple users Web ranking: ranking of web pages for different queries Question: How to learn related ranking tasks jointly? One-task vs Multi-task Different ranking functions are correlated Each function only ranked part of data f 1 ; f 2 ;:::; f m R d x 1 x 2 x n f 1 f 1 ( x ) … f 2 ( x ) f m ( x )
8 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
9 Conditional model on ranking outputs Ranking likelihood: Conditional on the latent function Prior: Gaussian Process prior for latent function Marginal ranking likelihood: Integrate out latent function values Ordinal regression likelihood Bayesian Ordinal Regression f » N ( f ; h ; K ) P ( y j f ; µ ) = Q i P ( y i j f ( x i ) ; µ ) P ( y j X ; f ; µ ) = P ( y j f ( x 1 ) ;:::; f ( x n ) ; µ ) = P ( y j f ; µ ) P ( y j X ; µ ; h ; K ) = Z P ( y j f ; µ ) P ( f j h ; K ) df
10 Need to define ranking likelihood Example Model (1): GP Regression (GPR) Assume a Gaussian form Regression on the ranking label directly Bayesian Ordinal Regression (1) P ( y i j f ( x i ) ; µ ) P ( y i j f ( x i ) ; µ ) / N ( y i ; f ( x i ) ; ¾ 2 )
11 Need to define ranking likelihood Example Model (2): GP Ordinal Regression (GPOR) (Chu & Ghahramani, 2005) A probit ranking likelihood Assign labels based on the surrounding area Bayesian Ordinal Regression (2) P ( y i j f ( x i ) ; µ ) b 1 b 2 b 3 b P ( y i j f ( x i ) ; µ ) = © µ b y i ¡ f ( x i ) ¾ ¶ ¡ © µ b y i ¡ 1 ¡ f ( x i ) ¾ ¶
12 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
13 Multi-task Setting? Naïve approach 1: Learn a GP model for each task No share of information between tasks Naïve approach 2: Fit one parametric kernel jointly The parametric kernel is too restrictive to fit all tasks The collaborative effects Common preferences: Functions share similar regression labels on some items Similar variabilities: Functions tend to have same predictability on similar items
14 Collaborative Ordinal Regression Hierarchical GP model for multi-task ordinal regression mean function: model common preferences covariance matrix: model similar variabilities Both mean function and (non-stationary) covariance matrix are learned from data h ; K f 1 f 2 ¢¢¢ f m y 1 y 2 ¢¢¢ y m x 1 23 ¢¢¢ 5 x 2 11 ¢¢¢ x n 34 ¢¢¢ 5 GP Prior Ordinal Regression Likelihood
15 COR: The Model Hierarchical Bayes model on functions All the latent functions are sampled from the same GP prior Allow different parameter settings for different tasks We may only observe part of rank labels for each function f j » N ( f j ; h ; K ) P ( D j X ; £ ; h ; K ) = m Y j = 1 P ( y j j X ; µ j ; h ; K ) = m Y j = 1 Z P ( y j j f j ; µ j ) P ( f j j h ; K ) df j
16 COR: The Key Points The GP prior connects all ordinal regression tasks Model the first and second sufficient statistics The lower level features are incorporated naturally More general than pure collaborative filtering We don’t fix a parametric form for the kernel Instead we assign the conjugate prior We can make predictions for new input data and new tasks P ( h ; K ) = N ( h ; h 0 ; 1 ¼ K ) IW ( K ;¿ ; K 0 )
17 Covariance matrix Toy Problem (GPR Model) Mean function Mean rank labels New task predictio n with base kernel (RBF) New task predictio n with learned kernel
18 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
19 Learning Variational lower bound EM Learning E-step: Approximate each posterior as a Gaussian Estimate the mean vector and covariance matrix using EP M-step: Fix and maximize w.r.t. and l og P ( D j X ; £ ; h ; K ) ¸ m X j = 1 Z Q ( f j ) l og P ( y j j f j ; µ j ) P ( f j j h ; K ) Q ( f j ) df j Q ( f j ) Q ( f j ) = N ( f j ; ^ f j ; ^ K j ) Q ( f j ) ( h ; K ) µ j
20 E-step The true posterior distribution factorizes: EP procedures Deletion: Delete factor from the approximated Gaussian Moments matching: Match moments by adding true likelihood Update: Update the factor Can be done analytically for the example models For GPR model the EP step is exact Q ( f ) / Q i P ( y i j f ( x i ) ; µ ) P ( f j h ; K ) t k ( X ) t k ( X ) Approximate with Gaussian factor t k ( X )
21 M-step Update GP prior: Does not depend on the form of ranking likelihood The conjugate prior corresponds to a smooth term Update likelihood parameter Do it separately for each task Have the same update equation as the single-task case ^ h = 1 ¼ + m µ P m j = 1 ^ f j + ¼ h 0 ¶ ^ K = 1 ¿ + m µ ¼ ( ^ h ¡ h 0 )( ^ h ¡ h 0 ) > + ¿ K 0 + P m j = 1 h ( ^ f j ¡ ^ h )( ^ f j ¡ ^ h ) > + ^ K j i ¶ µ j
22 Inference Ordinal Regression Non-stationary kernel on test data is unknown! Solution: work in the dual space (Yu et al. 2005) Posterior By constraint, posterior For test data we have f j » N ( f j ; ^ f j ; ^ K j ) f j = K ® j ® j » N ( ® j ; K ¡ 1 ^ f j ; K ¡ 1 ^ K j K ¡ 1 ) f ¤ j = k ¤ > ® j » N ( f ¤ j ; k ¤ > K ¡ 1 ^ f j ; k ¤ > K ¡ 1 ^ K j K ¡ 1 k ¤ ) P ( y ¤ j j x ¤ ; D ; X ; ^ µ j ; ^ h ; ^ K ) = Z P ( y ¤ j j f ¤ j ; ^ µ j ) P ( f ¤ j j x ¤ ; D ; X ; ^ h ; ^ K ) df ¤ j
23 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
24 Experiments Predict user ratings in movie data MovieLens: 591 movies, 943 users 19 features from the “Genre” part of each movie (binary) EachMovie: 1,075 movies, 72,916 users 23,753 features from online database (TF-IDF) Experimental Settings Pick up 100 users with the most ratings as “tasks” Randomly choose 10, 20, 50 ratings for each user for training Base kernel: cosine similarity
25 Comparison Metrics Ordinal Regression Evaluation Mean absolute error (MAE): Mean 0-1 error (MZOE): Use Macro & Micro average over multiple tasks Ranking Evaluation Normalized Discounted Cumulative Gain (NDCG): Only count the top 10 ranked items MAE ( ^ R ) = 1 t P t i = 1 j ^ R ( i ) ¡ R ( i ) j MZOE ( ^ R ) = 1 t P t i = 1 1 ^ R ( i ) 6 = R ( i ) NDCG ( ^ R ) / P t k = 1 2 r ( k ) ¡ 1 l og ( 1 + k )
26 Results - MovieLens N: Number of training items for each user MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005) State-of-the-art collaborative filtering model
27 Results - EachMovie N: Number of training items for each user MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005) State-of-the-art collaborative filtering model
28 New Ranking Functions Test on the rest users for MovieLens Use different kernels The more users we use for training, the better kernel we obtain!
29 Observations Collaborative models are always better than individual models We can learn a good non-stationary kernel from users GPR & CGPR are fast in training and robust in testing Since there is no approximation GPOR & CGPOR are slow and sometimes overfit Due to the numerical M-step We can use other ranking likelihood Then we may need to do numerical integration in EP step P ( y i j f ( x i ) ; µ )
30 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions
31 Conclusion A Bayesian framework for multi-task ordinal regression An efficient EM-EP learning algorithm COR is better than individual OR algorithms COR is better than pure collaborative filtering Experiments show very encouraging results
32 Extensions The framework is applicable to preference learning Collaborative version of GP preference learning (Chu & Ghahramani, 2005) A probabilistic version of RankNet (Burges et al. 2005) GP mixture model for multi-task learning Assign a Gaussian mixture model to each latent function Prediction uses a linear combination of learned kernels Connection to Dirichlet Processes P ( y i  y j j f ( x i ) ¡ f ( x j )) / exp ( f ( x i ) ¡ f ( x j )) 1 + exp ( f ( x i ) ¡ f ( x j ))
