Learning Recommender Systems with Adaptive Regularization

Learning Recommender Systems with Adaptive Regularization
Steffen Rendle WSDM 2012 Presenter: Haiqin Yang Date: Mar

Outline Introduction Factorization Machine
Factorization Machine with Adaptive Regularization Evaluation Conclusion More stories

Collaborative Filtering
Predict unobserved entries based on partial observed matrix

Overfitting Most state-of-the-art recommender methods have a large number of model parameters and thus are prone to overfiting. Low Rank Approximation

Solution to Overfitting
Typically L2-regularization is applied to prevent overtting, e.g.: Maximum margin matrix factorization Probabilistic matrix factorization

Regularization Parameters
A generalized formulation The success depends largely on the choice of the value(s) for If  is chosen too small, the model overfits. If  is chosen too small, the model underfits. Question: How to choose  efficiently?

How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation

Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases The Entire Regularization Path for the Support Vector Machine, JMLR 04 Piecewise linear regularized solution paths, Annals of Statistics 07 The authors present What are the “families” of regularized problems that have the piecewise linear property. That is The loss function is piecewise quadratic as a function of the parameters, The regularization is piecewise linear as a function of the parameters

Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases Hierarchical Bayesian Methods – Use a hierachical model with hyperpriors on prior distribution Typically optimized with Markov Chain Monte Carlo (MCMC)

Factorization Machine (FM)
Matrix Factorization (MF) Factorization Machine Model: Parameters:

FM vs. Other Factorization Models
Generalization MF SVD++ Pairwise Interaction Tensor Factorization (PITF) Factorization Personalized Markov Chains (FPMC) See “Factorization Machines” in ICDM2010 An example: MF Let

Optimization and Algorithm
Optimization Target Square loss Gradient descent

Adaptive Regularization
Split two datasets Find the regularization values * that lead to the lowest error on the validation set Alternating optimization Problem: the right hand size is independent of 

Hint: Next parameters depend on  Recall Expansion Objective Update rule

Update rule Gradients

Evaluation Datasets Methods Movielen 1M Netflix
Stochastic Gradient Descent (SGD) SGD with Adaptive regularization (SGDA)

Accuracy vs. Latent Dimensions

Convergence

Evolution of  Flexible regularization is better than one regularization value for all dimensions

Size of Validation Set Sv
The larger the validation set, the close to the test set Too larger validation set reduces training size, yielding poor performance

Conclusion An adaptive regularization method based on the Factorization Machine Systematical experiments to demonstrate the model performance

More Stories Reformulate the problem to create a new model: Factorization Machine Factorization machines, ICDM 2010 Fast context-aware recommendations with factorization machines, SIGIR 2011 Learning recommender systems with adaptive regularization, WSDM 2012 Bayesian factorization machines, NIPS2011 Workshop

More Stories Modify existing techniques for new models
Predictor-Corrector Predictor step: from (\alpha_0, \sigma_0), predict where (\alpha_0 + h) should be using the first order expansion, i.e., take \lambda_1 = \lambda_0 + h, \alpha_1 = \alpha_0 + h\partial\alpha\over\partial\sigmad (\sigma_0) (note that h can be chosen positive or negative, depending on the direction we want to follow). Corrector steps: (\alpha_1, \sigma_1) might not satisfy J(\alpha_1, \sigma_1) = 0, i.e., the tangent prediction might (and generally will) leave the curve \alpha(\sigma). In order to return to the curve, Newton’s method is used to solve the nonlinear system of equations (in \alpha) J(\alpha, \sigma_1) = 0, starting from \alpha= \alpha_1. If h is small enough, then the Newton steps will converge quadratically to a solution \alpha_2 of J(\alpha, \sigma_1) = 0

Learning Recommender Systems with Adaptive Regularization

Similar presentations

Presentation on theme: "Learning Recommender Systems with Adaptive Regularization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Learning Recommender Systems with Adaptive Regularization

Similar presentations

Presentation on theme: "Learning Recommender Systems with Adaptive Regularization"— Presentation transcript:

Similar presentations

About project

Feedback