Fast Maximum Margin Matrix Factorization for Collaborative Prediction Jason Rennie MIT Nati Srebro Univ. of Toronto
Collaborative Prediction Based on partially observed matrix: ) Predict unobserved entries ? ?? ? ?? ? ? ? ? “Will user i like movie j?” users movies
Problems to Address Underlying representation of preferences –Norm constrained matrix factorization (MMMF) Discrete, ordered labels –Threshold-based ordinal regression Scaling-up MMMF to large problems –Factorized objective, gradient descent Ratings may not be missing at random
Linear Factor Model v1v1 v2v2 v3v3 comic value dramatic value violence movies × × × Features Preferences movies Ratings movies User Preference Weights Feature Values
Ordinal Regression v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 Feature Vectors Preference Weights Preference ScoresRatings
Matrix Factorization v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w Feature Vectors Preference Weights Preference Scores Ratings
Matrix Factorization U V’ XY
Ordinal Regression v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 w1 Feature Vectors Preference Weights Preference ScoresRatings
Max-Margin Ordinal Regression [Shashua & Levin, NIPS 2002]
Absolute Difference Shashua & Levin’s loss bounds the misclassification error Ordinal Regression: we want to minimize the absolute difference between labels
All-Thresholds Loss [Srebro et al., NIPS 2004] [Chu & Keerthi, ICML 2005]
All-Thresholds Loss Experiments comparing: –Least squares regression –Multi-class classification –Shashua & Levin’s Max-Margin OR –All-Thresholds OR All-Thresholds Ordinal Regression –Lowest misclassification error –Lowest absolute difference error [Rennie & Srebro, IJCAI Wkshp 2005]
Learning Weights & Features v1 v2 v3 v4v5v6v7v8v9v10 Feature Vectors w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 w11 w12 w13 w14 w15 Preference Weights Ratings Preference Scores
Low Rank Matrix Factorization V’ U × ¼ X rank k = Y Use SVD to find Global Optimum Non-convex No explicit soln. Sum-Squared Loss Fully Observed Y Classification Error Loss Partially Observed Y
V’ UX low norm [Fazel et al., 2001] Norm Constrained Factorization ||X|| tr = min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 ||U|| Fro 2 = ∑ i,j U ij 2
MMMF Objective min X ||X|| tr + c loss(X,Y) All-Thresholds Original Objective [Srebro et al., NIPS 2004] X V’ U low norm ||U|| Fro 2 = ∑ i,j U ij 2 min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 + c loss(UV’,Y) Factorized Objective All-Thresholds
Smooth Hinge Hinge Gradient Smooth Hinge Function Value Hinge Smooth Hinge
Collaborative Prediction Results size, sparsity: EachMovie 36656x1648, 96% MovieLens 6040x3952, 96% Algorithm Weak Error Strong Error Weak Error Strong Error URP Attitude MMMF URP & Attitude Results: [Marlin, 2004]
Local Minima? min U,V (||U|| Fro 2 + ||V|| Fro 2 )/2 + c loss(UV’,Y) Factorized Objective Optimize X with SDP Optimize U,V with grad. descent
Local Minima? Matrix Difference Y X Data: 100 x 100 MovieLens, 65% sparse Objective Difference
Summary We scaled MMMF to large problems by optimizing the Factorized Objective Empirical tests indicate that local minima issues are rare or absent Results on large-scale data show substantial improvements over state-of- the-art D’Aspremont & Srebro: large-scale SDP optimization methods. Train on 1.5 million binary labels in 20 hours.