Presentation is loading. Please wait.

Presentation is loading. Please wait.

Learning Recommender Systems with Adaptive Regularization

Similar presentations


Presentation on theme: "Learning Recommender Systems with Adaptive Regularization"— Presentation transcript:

1 Learning Recommender Systems with Adaptive Regularization
Steffen Rendle WSDM 2012 Presenter: Haiqin Yang Date: Mar

2 Outline Introduction Factorization Machine
Factorization Machine with Adaptive Regularization Evaluation Conclusion More stories

3 Collaborative Filtering
Predict unobserved entries based on partial observed matrix

4 Overfitting Most state-of-the-art recommender methods have a large number of model parameters and thus are prone to overfiting. Low Rank Approximation

5 Solution to Overfitting
Typically L2-regularization is applied to prevent overtting, e.g.: Maximum margin matrix factorization Probabilistic matrix factorization

6 Regularization Parameters
A generalized formulation The success depends largely on the choice of the value(s) for If  is chosen too small, the model overfits. If  is chosen too small, the model underfits. Question: How to choose  efficiently?

7 How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation

8 How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases The Entire Regularization Path for the Support Vector Machine, JMLR 04 Piecewise linear regularized solution paths, Annals of Statistics 07 The authors present What are the “families” of regularized problems that have the piecewise linear property. That is The loss function is piecewise quadratic as a function of the parameters, The regularization is piecewise linear as a function of the parameters

9 How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases Hierarchical Bayesian Methods – Use a hierachical model with hyperpriors on prior distribution Typically optimized with Markov Chain Monte Carlo (MCMC)

10 Factorization Machine (FM)
Matrix Factorization (MF) Factorization Machine Model: Parameters:

11 FM vs. Other Factorization Models
Generalization MF SVD++ Pairwise Interaction Tensor Factorization (PITF) Factorization Personalized Markov Chains (FPMC) See “Factorization Machines” in ICDM2010 An example: MF Let

12 Optimization and Algorithm
Optimization Target Square loss Gradient descent

13 Adaptive Regularization
Split two datasets Find the regularization values * that lead to the lowest error on the validation set Alternating optimization Problem: the right hand size is independent of 

14 Adaptive Regularization
Hint: Next parameters depend on  Recall Expansion Objective Update rule

15 Adaptive Regularization
Update rule Gradients

16 Evaluation Datasets Methods Movielen 1M Netflix
Stochastic Gradient Descent (SGD) SGD with Adaptive regularization (SGDA)

17 Accuracy vs. Latent Dimensions

18 Convergence

19 Evolution of  Flexible regularization is better than one regularization value for all dimensions

20 Size of Validation Set Sv
The larger the validation set, the close to the test set Too larger validation set reduces training size, yielding poor performance

21 Conclusion An adaptive regularization method based on the Factorization Machine Systematical experiments to demonstrate the model performance

22 More Stories Reformulate the problem to create a new model: Factorization Machine Factorization machines, ICDM 2010 Fast context-aware recommendations with factorization machines, SIGIR 2011 Learning recommender systems with adaptive regularization, WSDM 2012 Bayesian factorization machines, NIPS2011 Workshop

23 More Stories Modify existing techniques for new models
Predictor-Corrector Predictor step: from (\alpha_0, \sigma_0), predict where (\alpha_0 + h) should be using the first order expansion, i.e., take \lambda_1 = \lambda_0 + h, \alpha_1 = \alpha_0 + h\partial\alpha\over\partial\sigmad (\sigma_0) (note that h can be chosen positive or negative, depending on the direction we want to follow). Corrector steps: (\alpha_1, \sigma_1) might not satisfy J(\alpha_1, \sigma_1) = 0, i.e., the tangent prediction might (and generally will) leave the curve \alpha(\sigma). In order to return to the curve, Newton’s method is used to solve the nonlinear system of equations (in \alpha) J(\alpha, \sigma_1) = 0, starting from \alpha= \alpha_1. If h is small enough, then the Newton steps will converge quadratically to a solution \alpha_2 of J(\alpha, \sigma_1) = 0

24 Q & A


Download ppt "Learning Recommender Systems with Adaptive Regularization"

Similar presentations


Ads by Google