Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Topic models Source: Topic models, David Blei, MLSS 09.
Active Appearance Models
Factorial Mixture of Gaussians and the Marginal Independence Model Ricardo Silva Joint work-in-progress with Zoubin Ghahramani.
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
Probabilistic Clustering-Projection Model for Discrete Data
Segmentation and Fitting Using Probabilistic Methods
Computer vision: models, learning and inference
A. Darwiche Learning in Bayesian Networks. A. Darwiche Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown.
Active Learning and Collaborative Filtering
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.
Visual Recognition Tutorial
2. Introduction Multiple Multiplicative Factor Model For Collaborative Filtering Benjamin Marlin University of Toronto. Department of Computer Science.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Scalable Text Mining with Sparse Generative Models
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Online Learning Algorithms
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Correlated Topic Models By Blei and Lafferty (NIPS 2005) Presented by Chunping Wang ECE, Duke University August 4 th, 2006.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
8/25/05 Cognitive Computations Software Tutorial Page 1 SNoW: Sparse Network of Winnows Presented by Nick Rizzolo.
Crowdsourcing with Multi- Dimensional Trust Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department of Electrical.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
1 A fast algorithm for learning large scale preference relations Vikas C. Raykar and Ramani Duraiswami University of Maryland College Park Balaji Krishnapuram.
Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.
Randomized Algorithms for Bayesian Hierarchical Clustering
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
EigenRank: A ranking oriented approach to collaborative filtering By Nathan N. Liu and Qiang Yang Presented by Zachary 1.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Latent Dirichlet Allocation
Gaussian Processes For Regression, Classification, and Prediction.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
NTU & MSRA Ming-Feng Tsai
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
ICONIP 2010, Sydney, Australia 1 An Enhanced Semi-supervised Recommendation Model Based on Green’s Function Dingyan Wang and Irwin King Dept. of Computer.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Learning in Bayesian Networks. Known Structure Complete Data Known Structure Incomplete Data Unknown Structure Complete Data Unknown Structure Incomplete.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Multimodal Learning with Deep Boltzmann Machines
J. Zhu, A. Ahmed and E.P. Xing Carnegie Mellon University ICML 2009
CSCI 5822 Probabilistic Models of Human and Machine Learning
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Michal Rosen-Zvi University of California, Irvine
Probabilistic Latent Preference Analysis
LECTURE 07: BAYESIAN ESTIMATION
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Presentation transcript:

Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate Technology

Superman The Pianist Star Wars The Matrix The Godfather American Beauty Motivations Very Dislike Very Like ·... ? Genre Directors Actors Descriptions … Ordinal Regression · ·... FeaturesRatings

3... Superman The Pianist Star Wars The Matrix The Godfather American Beauty Motivations (Cont.) ·... Features Collaborative Ordinal Regression · ·... · · · · … · · Ratings ? ? ? ? ? ? ? ? ?

4 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

5 Goal: Assign ranks to objects Different from classification/regression problem Binary classification: Has only 2 labels Multi-class classification: Ignores ordering property Regression: Only deals with real outputs Ranking Problem x 1 2 x x n r y Ordinal Regression Preference Learning x n  :::  x 1  x 2  ::: x 1 £ X ¢¢¢ £ x 2 X £ ¢¢¢ £ x n ££ ¢¢¢ X 1 < 2 < ::: < r lowest rank highest rank x 1 x n x 2 R d

6 Goal: Assign ordered labels to objects Applications User preference prediction Web ranking for search engines … Ordinal Regression 1 2r... R d x 1 x 2 x n f f ( x ) b r b r ¡ 1 b 2 b 1 b 0 y n y 2 y 1

7 Common in real world problems Collaborative filtering: preference learning for multiple users Web ranking: ranking of web pages for different queries Question: How to learn related ranking tasks jointly? One-task vs Multi-task Different ranking functions are correlated Each function only ranked part of data f 1 ; f 2 ;:::; f m R d x 1 x 2 x n f 1 f 1 ( x ) … f 2 ( x ) f m ( x )

8 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

9 Conditional model on ranking outputs Ranking likelihood: Conditional on the latent function Prior: Gaussian Process prior for latent function Marginal ranking likelihood: Integrate out latent function values Ordinal regression likelihood Bayesian Ordinal Regression f » N ( f ; h ; K ) P ( y j f ; µ ) = Q i P ( y i j f ( x i ) ; µ ) P ( y j X ; f ; µ ) = P ( y j f ( x 1 ) ;:::; f ( x n ) ; µ ) = P ( y j f ; µ ) P ( y j X ; µ ; h ; K ) = Z P ( y j f ; µ ) P ( f j h ; K ) df

10 Need to define ranking likelihood Example Model (1): GP Regression (GPR) Assume a Gaussian form Regression on the ranking label directly Bayesian Ordinal Regression (1) P ( y i j f ( x i ) ; µ ) P ( y i j f ( x i ) ; µ ) / N ( y i ; f ( x i ) ; ¾ 2 )

11 Need to define ranking likelihood Example Model (2): GP Ordinal Regression (GPOR) (Chu & Ghahramani, 2005) A probit ranking likelihood Assign labels based on the surrounding area Bayesian Ordinal Regression (2) P ( y i j f ( x i ) ; µ ) b 1 b 2 b 3 b P ( y i j f ( x i ) ; µ ) = © µ b y i ¡ f ( x i ) ¾ ¶ ¡ © µ b y i ¡ 1 ¡ f ( x i ) ¾ ¶

12 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

13 Multi-task Setting? Naïve approach 1: Learn a GP model for each task No share of information between tasks Naïve approach 2: Fit one parametric kernel jointly The parametric kernel is too restrictive to fit all tasks The collaborative effects Common preferences: Functions share similar regression labels on some items Similar variabilities: Functions tend to have same predictability on similar items

14 Collaborative Ordinal Regression Hierarchical GP model for multi-task ordinal regression mean function: model common preferences covariance matrix: model similar variabilities Both mean function and (non-stationary) covariance matrix are learned from data h ; K f 1 f 2 ¢¢¢ f m y 1 y 2 ¢¢¢ y m x 1 23 ¢¢¢ 5 x 2 11 ¢¢¢ x n 34 ¢¢¢ 5 GP Prior Ordinal Regression Likelihood

15 COR: The Model Hierarchical Bayes model on functions All the latent functions are sampled from the same GP prior Allow different parameter settings for different tasks We may only observe part of rank labels for each function f j » N ( f j ; h ; K ) P ( D j X ; £ ; h ; K ) = m Y j = 1 P ( y j j X ; µ j ; h ; K ) = m Y j = 1 Z P ( y j j f j ; µ j ) P ( f j j h ; K ) df j

16 COR: The Key Points The GP prior connects all ordinal regression tasks Model the first and second sufficient statistics The lower level features are incorporated naturally More general than pure collaborative filtering We don’t fix a parametric form for the kernel Instead we assign the conjugate prior We can make predictions for new input data and new tasks P ( h ; K ) = N ( h ; h 0 ; 1 ¼ K ) IW ( K ;¿ ; K 0 )

17 Covariance matrix Toy Problem (GPR Model) Mean function Mean rank labels New task predictio n with base kernel (RBF) New task predictio n with learned kernel

18 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

19 Learning Variational lower bound EM Learning E-step: Approximate each posterior as a Gaussian Estimate the mean vector and covariance matrix using EP M-step: Fix and maximize w.r.t. and l og P ( D j X ; £ ; h ; K ) ¸ m X j = 1 Z Q ( f j ) l og P ( y j j f j ; µ j ) P ( f j j h ; K ) Q ( f j ) df j Q ( f j ) Q ( f j ) = N ( f j ; ^ f j ; ^ K j ) Q ( f j ) ( h ; K ) µ j

20 E-step The true posterior distribution factorizes: EP procedures Deletion: Delete factor from the approximated Gaussian Moments matching: Match moments by adding true likelihood Update: Update the factor Can be done analytically for the example models For GPR model the EP step is exact Q ( f ) / Q i P ( y i j f ( x i ) ; µ ) P ( f j h ; K ) t k ( X ) t k ( X ) Approximate with Gaussian factor t k ( X )

21 M-step Update GP prior: Does not depend on the form of ranking likelihood The conjugate prior corresponds to a smooth term Update likelihood parameter Do it separately for each task Have the same update equation as the single-task case ^ h = 1 ¼ + m µ P m j = 1 ^ f j + ¼ h 0 ¶ ^ K = 1 ¿ + m µ ¼ ( ^ h ¡ h 0 )( ^ h ¡ h 0 ) > + ¿ K 0 + P m j = 1 h ( ^ f j ¡ ^ h )( ^ f j ¡ ^ h ) > + ^ K j i ¶ µ j

22 Inference Ordinal Regression Non-stationary kernel on test data is unknown! Solution: work in the dual space (Yu et al. 2005) Posterior By constraint, posterior For test data we have f j » N ( f j ; ^ f j ; ^ K j ) f j = K ® j ® j » N ( ® j ; K ¡ 1 ^ f j ; K ¡ 1 ^ K j K ¡ 1 ) f ¤ j = k ¤ > ® j » N ( f ¤ j ; k ¤ > K ¡ 1 ^ f j ; k ¤ > K ¡ 1 ^ K j K ¡ 1 k ¤ ) P ( y ¤ j j x ¤ ; D ; X ; ^ µ j ; ^ h ; ^ K ) = Z P ( y ¤ j j f ¤ j ; ^ µ j ) P ( f ¤ j j x ¤ ; D ; X ; ^ h ; ^ K ) df ¤ j

23 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

24 Experiments Predict user ratings in movie data MovieLens: 591 movies, 943 users 19 features from the “Genre” part of each movie (binary) EachMovie: 1,075 movies, 72,916 users 23,753 features from online database (TF-IDF) Experimental Settings Pick up 100 users with the most ratings as “tasks” Randomly choose 10, 20, 50 ratings for each user for training Base kernel: cosine similarity

25 Comparison Metrics Ordinal Regression Evaluation Mean absolute error (MAE): Mean 0-1 error (MZOE): Use Macro & Micro average over multiple tasks Ranking Evaluation Normalized Discounted Cumulative Gain (NDCG): Only count the top 10 ranked items MAE ( ^ R ) = 1 t P t i = 1 j ^ R ( i ) ¡ R ( i ) j MZOE ( ^ R ) = 1 t P t i = 1 1 ^ R ( i ) 6 = R ( i ) NDCG ( ^ R ) / P t k = 1 2 r ( k ) ¡ 1 l og ( 1 + k )

26 Results - MovieLens N: Number of training items for each user MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005) State-of-the-art collaborative filtering model

27 Results - EachMovie N: Number of training items for each user MMMF: Maximum Margin Matrix Factorization (Srebro et al 2005) State-of-the-art collaborative filtering model

28 New Ranking Functions Test on the rest users for MovieLens Use different kernels The more users we use for training, the better kernel we obtain!

29 Observations Collaborative models are always better than individual models We can learn a good non-stationary kernel from users GPR & CGPR are fast in training and robust in testing Since there is no approximation GPOR & CGPOR are slow and sometimes overfit Due to the numerical M-step We can use other ranking likelihood Then we may need to do numerical integration in EP step P ( y i j f ( x i ) ; µ )

30 Outline Motivations Ranking Problem Bayesian Framework for Ordinal Regression Collaborative Ordinal Regression Learning and Inference Experiments Conclusion and Extensions

31 Conclusion A Bayesian framework for multi-task ordinal regression An efficient EM-EP learning algorithm COR is better than individual OR algorithms COR is better than pure collaborative filtering Experiments show very encouraging results

32 Extensions The framework is applicable to preference learning Collaborative version of GP preference learning (Chu & Ghahramani, 2005) A probabilistic version of RankNet (Burges et al. 2005) GP mixture model for multi-task learning Assign a Gaussian mixture model to each latent function Prediction uses a linear combination of learned kernels Connection to Dirichlet Processes P ( y i  y j j f ( x i ) ¡ f ( x j )) / exp ( f ( x i ) ¡ f ( x j )) 1 + exp ( f ( x i ) ¡ f ( x j ))

Thanks! Questions?