Learning Recommender Systems with Adaptive Regularization

Slides:



Advertisements
Similar presentations
Data-Assimilation Research Centre
Advertisements

Neural networks Introduction Fitting neural networks
Machine Learning and Data Mining Linear regression
Pattern Recognition and Machine Learning
Supervised Learning Recap
Data mining and statistical learning - lecture 6
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
Lecture 14 – Neural Networks
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Intelligible Models for Classification and Regression
Classification and Prediction: Regression Analysis
Collaborative Filtering Matrix Factorization Approach
Cao et al. ICML 2010 Presented by Danushka Bollegala.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Introduction to Machine Learning for Information Retrieval Xiaolong Wang.
Yan Yan, Mingkui Tan, Ivor W. Tsang, Yi Yang,
Evaluating Hypotheses Reading: Coursepack: Learning From Examples, Section 4 (pp )
Bayesian Sets Zoubin Ghahramani and Kathertine A. Heller NIPS 2005 Presented by Qi An Mar. 17 th, 2006.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
GAUSSIAN PROCESS FACTORIZATION MACHINES FOR CONTEXT-AWARE RECOMMENDATIONS Trung V. Nguyen, Alexandros Karatzoglou, Linas Baltrunas SIGIR 2014 Presentation:
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Online Learning for Collaborative Filtering
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
CoFi Rank : Maximum Margin Matrix Factorization for Collaborative Ranking Markus Weimer, Alexandros Karatzoglou, Quoc Viet Le and Alex Smola NIPS’07.
ICDCS 2014 Madrid, Spain 30 June-3 July 2014
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
Matrix Factorization Reporter : Sun Yuanshuai
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Generalization Performance of Exchange Monte Carlo Method for Normal Mixture Models Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
PREDICT 422: Practical Machine Learning
Collaborative Filtering for Streaming data
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Evaluating Classifiers
Deep Feedforward Networks
Sparse Kernel Machines
Data Assimilation Theory CTCD Data Assimilation Workshop Nov 2005
Ch3: Model Building through Regression
CSE 4705 Artificial Intelligence
Bias and Variance of the Estimator
Probabilistic Models for Linear Regression
Statistical Learning Dong Liu Dept. EEIS, USTC.
CSC 578 Neural Networks and Deep Learning
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
Instructor :Dr. Aamer Iqbal Bhatti
Pattern Recognition and Machine Learning
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Model generalization Brief summary of methods
Neural networks (1) Traditional multi-layer perceptrons
Introduction to Neural Networks
Adversarial Personalized Ranking for Recommendation
Introduction to Machine learning
CSC 578 Neural Networks and Deep Learning
Reinforcement Learning (2)
Presentation transcript:

Learning Recommender Systems with Adaptive Regularization Steffen Rendle WSDM 2012 Presenter: Haiqin Yang Date: Mar. 21 2012

Outline Introduction Factorization Machine Factorization Machine with Adaptive Regularization Evaluation Conclusion More stories

Collaborative Filtering Predict unobserved entries based on partial observed matrix

Overfitting Most state-of-the-art recommender methods have a large number of model parameters and thus are prone to overfiting. Low Rank Approximation

Solution to Overfitting Typically L2-regularization is applied to prevent overtting, e.g.: Maximum margin matrix factorization Probabilistic matrix factorization

Regularization Parameters A generalized formulation The success depends largely on the choice of the value(s) for If  is chosen too small, the model overfits. If  is chosen too small, the model underfits. Question: How to choose  efficiently?

How to select parameters? Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation

How to select parameters? Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases The Entire Regularization Path for the Support Vector Machine, JMLR 04 Piecewise linear regularized solution paths, Annals of Statistics 07 The authors present What are the “families” of regularized problems that have the piecewise linear property. That is The loss function is piecewise quadratic as a function of the parameters, The regularization is piecewise linear as a function of the parameters

How to select parameters? Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases Hierarchical Bayesian Methods – Use a hierachical model with hyperpriors on prior distribution Typically optimized with Markov Chain Monte Carlo (MCMC)

Factorization Machine (FM) Matrix Factorization (MF) Factorization Machine Model: Parameters:

FM vs. Other Factorization Models Generalization MF SVD++ Pairwise Interaction Tensor Factorization (PITF) Factorization Personalized Markov Chains (FPMC) See “Factorization Machines” in ICDM2010 An example: MF Let

Optimization and Algorithm Optimization Target Square loss Gradient descent

Adaptive Regularization Split two datasets Find the regularization values * that lead to the lowest error on the validation set Alternating optimization Problem: the right hand size is independent of 

Adaptive Regularization Hint: Next parameters depend on  Recall Expansion Objective Update rule

Adaptive Regularization Update rule Gradients

Evaluation Datasets Methods Movielen 1M Netflix Stochastic Gradient Descent (SGD) SGD with Adaptive regularization (SGDA)

Accuracy vs. Latent Dimensions

Convergence

Evolution of  Flexible regularization is better than one regularization value for all dimensions

Size of Validation Set Sv The larger the validation set, the close to the test set Too larger validation set reduces training size, yielding poor performance

Conclusion An adaptive regularization method based on the Factorization Machine Systematical experiments to demonstrate the model performance

More Stories Reformulate the problem to create a new model: Factorization Machine Factorization machines, ICDM 2010 Fast context-aware recommendations with factorization machines, SIGIR 2011 Learning recommender systems with adaptive regularization, WSDM 2012 Bayesian factorization machines, NIPS2011 Workshop

More Stories Modify existing techniques for new models Predictor-Corrector Predictor step: from (\alpha_0, \sigma_0), predict where (\alpha_0 + h) should be using the first order expansion, i.e., take \lambda_1 = \lambda_0 + h, \alpha_1 = \alpha_0 + h\partial\alpha\over\partial\sigmad (\sigma_0) (note that h can be chosen positive or negative, depending on the direction we want to follow). Corrector steps: (\alpha_1, \sigma_1) might not satisfy J(\alpha_1, \sigma_1) = 0, i.e., the tangent prediction might (and generally will) leave the curve \alpha(\sigma). In order to return to the curve, Newton’s method is used to solve the nonlinear system of equations (in \alpha) J(\alpha, \sigma_1) = 0, starting from \alpha= \alpha_1. If h is small enough, then the Newton steps will converge quadratically to a solution \alpha_2 of J(\alpha, \sigma_1) = 0

Q & A