Principled Regularization for Probabilistic Matrix Factorization Robert Bell, Suhrid Balakrishnan AT&T Labs-Research Duke Workshop on Sensing and Analysis.

Slides:



Advertisements
Similar presentations
Neural Networks and Kernel Methods
Advertisements

General Linear Model With correlated error terms  =  2 V ≠  2 I.
EE 690 Design of Embodied Intelligence
Neural networks Introduction Fitting neural networks
Regression “A new perspective on freedom” TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A A A AAA A A.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Multidimensional Adaptive Testing with Optimal Design Criteria for Item Selection Joris Mulder & Wim J. Van Der Linden 1.
Pattern Recognition and Machine Learning
Shinichi Nakajima Sumio Watanabe  Tokyo Institute of Technology
Model Assessment, Selection and Averaging
Chapter 2: Lasso for linear models
The loss function, the normal equation,
x – independent variable (input)
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Motion Analysis (contd.) Slides are from RPI Registration Class.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Customizable Bayesian Collaborative Filtering Denver Dash Big Data Reading Group 11/19/2007.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Nonlinear Dimensionality Reduction Approaches. Dimensionality Reduction The goal: The meaningful low-dimensional structures hidden in their high-dimensional.
Collaborative Filtering Matrix Factorization Approach
Jinhui Tang †, Shuicheng Yan †, Richang Hong †, Guo-Jun Qi ‡, Tat-Seng Chua † † National University of Singapore ‡ University of Illinois at Urbana-Champaign.
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Informatics and Mathematical Modelling / Intelligent Signal Processing 1 EUSIPCO’09 27 August 2009 Tuning Pruning in Sparse Non-negative Matrix Factorization.
沈致远. Test error(generalization error): the expected prediction error over an independent test sample Training error: the average loss over the training.
Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.
Fast Max–Margin Matrix Factorization with Data Augmentation Minjie Xu, Jun Zhu & Bo Zhang Tsinghua University.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Training and Testing of Recommender Systems on Data Missing Not at Random Harald Steck at KDD, July 2010 Bell Labs, Murray Hill.
Andrew Thomson on Generalised Estimating Equations (and simulation studies)
A C B Small Model Middle Model Large Model Figure 1 Parameter Space The set of parameters of a small model is an analytic set with singularities. Rank.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Online Learning for Collaborative Filtering
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
1 Analytic Solution of Hierarchical Variational Bayes Approach in Linear Inverse Problem Shinichi Nakajima, Sumio Watanabe Nikon Corporation Tokyo Institute.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
CSC321: Lecture 7:Ways to prevent overfitting
Math 4030 – 11b Method of Least Squares. Model: Dependent (response) Variable Independent (control) Variable Random Error Objectives: Find (estimated)
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.
Matrix Factorization and its applications By Zachary 16 th Nov, 2010.
Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.
Ridge Regression: Biased Estimation for Nonorthogonal Problems by A.E. Hoerl and R.W. Kennard Regression Shrinkage and Selection via the Lasso by Robert.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Regularized Least-Squares and Convex Optimization.
Collaborative Deep Learning for Recommender Systems
Bias-Variance Analysis in Regression  True function is y = f(x) +  where  is normally distributed with zero mean and standard deviation .  Given a.
October Rotation Paper-Reviewer Matching Dina Elreedy Supervised by: Prof. Sanmay Das.
Canadian Bioinformatics Workshops
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Probability Theory and Parameter Estimation I
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Learning Recommender Systems with Adaptive Regularization
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
Special Topics In Scientific Computing
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Unfolding Problem: A Machine Learning Approach
Collaborative Filtering Matrix Factorization Approach
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Recognition and Machine Learning
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Principled Regularization for Probabilistic Matrix Factorization Robert Bell, Suhrid Balakrishnan AT&T Labs-Research Duke Workshop on Sensing and Analysis of High-Dimensional Data July 26-28, 2011

2 Probabilistic Matrix Factorization (PMF) Approximate a large n-by-m matrix R by –M = P Q –P and Q each have k rows, k << n, m –m ui = p u q i –R may be sparsely populated Prime tool in Netflix Prize –99% of ratings were missing

Regularization for PMF Needed to avoid overfitting –Even after limiting rank of M –Critical for sparse, imbalanced data Penalized least squares –Minimize 3

Regularization for PMF Needed to avoid overfitting –Even after limiting rank of M –Critical for sparse, imbalanced data Penalized least squares –Minimize –or 4

Regularization for PMF Needed to avoid overfitting –Even after limiting rank of M –Critical for sparse, imbalanced data Penalized least squares –Minimize –or – ’s selected by cross validation 5

Research Questions Should we use separate P and Q ? 6

Research Questions Should we use separate P and Q ? Should we use k separate ’s for each dimension of P and Q? 7

Matrix Completion with Noise (Candes and Plan, Proc IEEE, 2010) Rank reduction without explicit factors –No pre-specification of k, rank(M) Regularization applied directly to M –Trace norm, aka, nuclear norm –Sum of the singular values of M Minimize subject to “Equivalent” to L 2 regularization for P, Q 8

Research Questions Should we use separate P and Q ? Should we use k separate ’s for each dimension of P and Q? Should we use the trace norm for regularization? 9

Bayesian Matrix Factorization (BPMF) (Salakhutdinov and Mnih, ICML 2008) Let r ui ~ N (p u q i,  2 ) No PMF-type regularization p u ~ N (  P,  P -1 ) and q i ~ N (  Q,  Q -1 ) Priors for  2,  P,  Q,  P,  Q Fit by Gibbs sampling Substantial reduction in prediction error relative to PMF with L 2 regularization 10

Research Questions Should we use separate P and Q ? Should we use k separate reg. parameters for each dimension of P and Q? Should we use the trace norm for regularization? Does BPMF “regularize” appropriately? 11

Matrix Factorization with Biases Let m ui =  + a u + b i + p u q i Regularization similar to before –Minimize 12

Matrix Factorization with Biases Let m ui =  + a u + b i + p u q i Regularization similar to before –Minimize –or 13

Research Questions Should we use separate P and Q ? Should we use k separate reg. parameters for each dimension of P and Q? Should we use the trace norm for regularization? Does BPMF “regularize” appropriately? Should we use separate ’s for the biases? 14

Some Things this Talk Will Not Cover Various extensions of PMF –Combining explicit and implicit feedback –Time varying factors –Non-negative matrix factorization –L 1 regularization – ’s depending on user or item sample sizes Efficiency of optimization algorithms –Use Newton’s method, each coordinate separately –Iterate to convergence 15

No Need for Separate P and Q M = (cP)(c -1 Q) is invariant for c ≠ 0 For initial P and Q –Solve for c to minimize –c = –Gives Sufficient to let P = Q = PQ 16

Bayesian Motivation for L 2 Regularization Simplest case: only one item –R is n-by-1 –R u1 = a 1 +  ui, a 1 ~ N (0,  2 ),  ui ~ N (0,  2 ) Posterior mean (or MAP) of a 1 satisfies – – a = (  2 /  2 ) – Best is inversely proportional to  2 17

Implications for Regularization of PMF Allow a ≠ b –If  a 2 ≠  b 2 Allow a ≠ b ≠ PQ Allow PQ1 ≠ PQ2 ≠ … ≠ PQk ? –Trace norm does not –BPMF appears to 18

Simulation Experiment Structure n = 2,500 users, m = 400 items 250,000 observed ratings –150,000 in Training (to estimate a, b, P, Q) –50,000 in Validation (to tune ’s) –50,000 in Test (to estimate MSE) Substantial imbalance in ratings –8 to 134 ratings per user in Training data –33 to 988 ratings per item in Training data 19

Simulation Model r ui = a u + b i + p u1 q i1 + p u2 q i2 +  ui Elements of a, b, P, Q, and  –Independent normals with mean 0 –Var(a u ) = 0.09 –Var(b i ) = 0.16 –Var(p u1 q i1 ) = 0.04 –Var(p u2 q i2 ) = 0.01 –Var(  ui ) =

Evaluation Test MSE for estimation of m ui = E(r ui ) –MSE = Limitations –Not real data –Only one replication –No standard errors 21

PMF Results for k = 0 Restrictions on ’sValues of a, b MSE for m  MSE Grand mean; no ( a, b ) NA

PMF Results for k = 0 Restrictions on ’sValues of a, b MSE for m  MSE Grand mean; no ( a, b ) NA.2979 a = b =

PMF Results for k = 0 Restrictions on ’sValues of a, b MSE for m  MSE Grand mean; no ( a, b ) NA.2979 a = b = a = b

PMF Results for k = 0 Restrictions on ’sValues of a, b MSE for m  MSE Grand mean; no ( a, b ) NA.2979 a = b = a = b Separate a, b 9.26,

PMF Results for k = 1 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b 9.26,

PMF Results for k = 1 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b 9.26, a = b = PQ

PMF Results for k = 1 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b 9.26, a = b = PQ Separate a, b, PQ1 8.50, 10.13,

PMF Results for k = 2 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b, PQ1 8.50, 10.13, 13.44, NA

PMF Results for k = 2 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b, PQ1 8.50, 10.13, 13.44, NA.0439 a, b, PQ1 = PQ2 8.44, 9.94, 19.84,

PMF Results for k = 2 Restrictions on ’sValues of a, b, PQ1 MSE for m  MSE Separate a, b, PQ1 8.50, 10.13, 13.44, NA.0439 a, b, PQ1 = PQ2 8.44, 9.94, 19.84, Separate a, b, PQ1, PQ2 8.43, 10.24, 13.38,

Results for Matrix Completion Performs poorly on raw ratings –MSE =.0693 –Not designed to estimate biases Fit to residuals from PMF with k = 0 –MSE =.0477 –“Recovered” rank was 1 –Worse than MSE’s from PMF:.0428 to

Results for BPMF Raw ratings –MSE =.0498, using k = 3 –Early stopping –Not designed to estimate biases Fit to residuals from PMF with k = 0 –MSE =.0433, using k = 2 –Near.0428, for best PMF w/ biases 33

Summary No need for separate P and Q Theory suggests using separate ’s for distinct sets of exchangeable parameters –Biases vs. factors –For individual factors Tentative simulation results support need for separate ’s across factors –BPMF does so automatically –PMF requires a way to do efficient tuning 34