Download presentation
Presentation is loading. Please wait.
Published byGwendoline Gilbert Modified over 6 years ago
1
Learning Recommender Systems with Adaptive Regularization
Steffen Rendle WSDM 2012 Presenter: Haiqin Yang Date: Mar
2
Outline Introduction Factorization Machine
Factorization Machine with Adaptive Regularization Evaluation Conclusion More stories
3
Collaborative Filtering
Predict unobserved entries based on partial observed matrix
4
Overfitting Most state-of-the-art recommender methods have a large number of model parameters and thus are prone to overfiting. Low Rank Approximation
5
Solution to Overfitting
Typically L2-regularization is applied to prevent overtting, e.g.: Maximum margin matrix factorization Probabilistic matrix factorization
6
Regularization Parameters
A generalized formulation The success depends largely on the choice of the value(s) for If is chosen too small, the model overfits. If is chosen too small, the model underfits. Question: How to choose efficiently?
7
How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation
8
How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases The Entire Regularization Path for the Support Vector Machine, JMLR 04 Piecewise linear regularized solution paths, Annals of Statistics 07 The authors present What are the “families” of regularized problems that have the piecewise linear property. That is The loss function is piecewise quadratic as a function of the parameters, The regularization is piecewise linear as a function of the parameters
9
How to select parameters?
Validation Set Based Methods – Search for optimal values using a withheld validation set Grid search by cross validation Informed search: too complicated Regularization path: not common for all cases Hierarchical Bayesian Methods – Use a hierachical model with hyperpriors on prior distribution Typically optimized with Markov Chain Monte Carlo (MCMC)
10
Factorization Machine (FM)
Matrix Factorization (MF) Factorization Machine Model: Parameters:
11
FM vs. Other Factorization Models
Generalization MF SVD++ Pairwise Interaction Tensor Factorization (PITF) Factorization Personalized Markov Chains (FPMC) See “Factorization Machines” in ICDM2010 An example: MF Let
12
Optimization and Algorithm
Optimization Target Square loss Gradient descent
13
Adaptive Regularization
Split two datasets Find the regularization values * that lead to the lowest error on the validation set Alternating optimization Problem: the right hand size is independent of
14
Adaptive Regularization
Hint: Next parameters depend on Recall Expansion Objective Update rule
15
Adaptive Regularization
Update rule Gradients
16
Evaluation Datasets Methods Movielen 1M Netflix
Stochastic Gradient Descent (SGD) SGD with Adaptive regularization (SGDA)
17
Accuracy vs. Latent Dimensions
18
Convergence
19
Evolution of Flexible regularization is better than one regularization value for all dimensions
20
Size of Validation Set Sv
The larger the validation set, the close to the test set Too larger validation set reduces training size, yielding poor performance
21
Conclusion An adaptive regularization method based on the Factorization Machine Systematical experiments to demonstrate the model performance
22
More Stories Reformulate the problem to create a new model: Factorization Machine Factorization machines, ICDM 2010 Fast context-aware recommendations with factorization machines, SIGIR 2011 Learning recommender systems with adaptive regularization, WSDM 2012 Bayesian factorization machines, NIPS2011 Workshop
23
More Stories Modify existing techniques for new models
Predictor-Corrector Predictor step: from (\alpha_0, \sigma_0), predict where (\alpha_0 + h) should be using the first order expansion, i.e., take \lambda_1 = \lambda_0 + h, \alpha_1 = \alpha_0 + h\partial\alpha\over\partial\sigmad (\sigma_0) (note that h can be chosen positive or negative, depending on the direction we want to follow). Corrector steps: (\alpha_1, \sigma_1) might not satisfy J(\alpha_1, \sigma_1) = 0, i.e., the tangent prediction might (and generally will) leave the curve \alpha(\sigma). In order to return to the curve, Newton’s method is used to solve the nonlinear system of equations (in \alpha) J(\alpha, \sigma_1) = 0, starting from \alpha= \alpha_1. If h is small enough, then the Newton steps will converge quadratically to a solution \alpha_2 of J(\alpha, \sigma_1) = 0
24
Q & A
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.