Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization.

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
R Squared. r = r = -.79 y = x y = x if x = 15, y = ? y = (15) y = if x = 6, y = ? y = (6)
Semi-Stochastic Gradient Descent Methods Jakub Konečný University of Edinburgh BASP Frontiers Workshop January 28, 2014.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Some terminology When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models The intercept and.
Regression What is regression to the mean?
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Stochastic Matrix Factorization Max Welling. SMF Last time: The SVD can do a matrix factorization of the user-item-rating matrix. Main question to answer:
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Neural Networks Marco Loog.
September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.
Sample Midterm question. Sue want to build a model to predict movie ratings. She has a matrix of data, where for M movies and U users she has collected.
Tracking using the Kalman Filter. Point Tracking Estimate the location of a given point along a sequence of images. (x 0,y 0 ) (x n,y n )
Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
ICS 273A UC Irvine Instructor: Max Welling Neural Networks.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
Module 04: Algorithms Topic 07: Instance-Based Learning
Matrix Factorization Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.
EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.
Chapter 9 – Classification and Regression Trees
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
Investigation of Various Factorization Methods for Large Recommender Systems G. Takács, I. Pilászy, B. Németh and D. Tikk 10th International.
PS 225 Lecture 20 Linear Regression Equation and Prediction.
LEAST MEAN-SQUARE (LMS) ADAPTIVE FILTERING. Steepest Descent The update rule for SD is where or SD is a deterministic algorithm, in the sense that p and.
Introduction to Neural Networks. Biological neural activity –Each neuron has a body, an axon, and many dendrites Can be in one of the two states: firing.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
CSC321: Lecture 7:Ways to prevent overfitting
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Machine Learning ICS 178 Instructor: Max Welling Supervised Learning.
INTRO TO OPTIMIZATION MATH-415 Numerical Analysis 1.
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques
Artificial Neural Networks
Lecture 07: Soft-margin SVM
A Simple Artificial Neuron
Collaborative Filtering Matrix Factorization Approach
Lecture 07: Soft-margin SVM
Prof. Ramin Zabih Least squares fitting Prof. Ramin Zabih
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
R Squared.
Large Scale Support Vector Machines
Instructor :Dr. Aamer Iqbal Bhatti
Neural Networks ICS 273A UC Irvine Instructor: Max Welling
CS 188: Artificial Intelligence Fall 2008
R = R Squared
Reinforcement Learning (2)
Chapter 14, part C Goodness of Fit..
Reinforcement Learning (2)
Presentation transcript:

Lecture 5 Instructor: Max Welling Squared Error Matrix Factorization

Remember User Ratings movies (+/- 17,770) users (+/- 240,000) total of +/- 400,000,000 nonzero entries (99% sparse) We are going to “squeeze out” the noisy, un-informative information from the matrix. In doing so, the algorithm will learn to only retain the most valuable information for predicting the ratings in the data matrix. This can then be used to more reliably predict the entries that were not rated.

Squeezing out Information movies (+/- 17,770) users (+/- 240,000) total of +/- 400,000,000 nonzero entries (99% sparse) users (+/- 240,000) movies (+/- 17,770) x K K “K” is the number of factors, or topics.

Interpretation movies (+/- 17,770) K Before, each movie was characterized by a signature over 240,000 user-ratings. Now, each movie is represented by a signature of K “movie-genre” values (or topics, or factors) Movies that are similar are expected to have similar values for their movie genres. star-wars = [10,3,-4,-1,0.01] K users (+/- 240,000) Before, users were characterized by their ratings over all movies. Now, each user is represented by his/her values over movie-genres. Users that are similar have similar values for their movie-genres.

Dual Interpretation movies (+/- 17,770) K Interestingly, we can also interpret the factors/topics as user-communities. Each user is represented by its “participation” in K communities. Each movie is represented by the ratings of these communities for the movie. Pick what you like! I’ll call them topics or factors from now on. K users (+/- 240,000)

The Algorithm In an equation, the squeezing boils down to: We want to find A,B such that we can reproduce the observed ratings as best as we can, given that we are only given K topics. This reconstruction of the ratings will not be perfect (otherwise we over-fit) To do so, we minimize the squared error (Frobenius norm): A B

Prediction We train A,B to reproduce the observed ratings only and we ignore all the non-rated movie-user pairs. But here come the magic: after learning A,B, the product AxB will have filled in values for the non-rated movie-user pairs! It has implicitly used the information from similar users and similar movies to generate ratings for movies that weren’t there before. This is what “learning” is: we can predict things from we haven’t seen before by looking at old data.

The Algorithm We want to minimize the Error. Let’s take the gradients again. But wait, how do we ignore the non-observed entries ? for netflix, this will actually be very slow, how can we make this practical ? We could now follow the gradients again:

Stochastic Gradient Descent First we pick a single observed movie-user pair at random: Rmu. The we ignore the sums over u,m in the exact gradients, and do an update: The trick is that although we don’t follow the exact gradient, on average we do move in the correct direction.

Stochastic Gradient Descent stochastic updates full updates (averaged over all data-items) Stochastic gradient descent does not converge to the minimum, but “dances” around it. To get to the minimum, one needs to decrease the step-size as one get closer to the minimum. Alternatively, one can obtain a few samples and average predictions over them (similar to bagging).

Weight-Decay Often it is good to make sure the values in A,B do not grow too big. We can make that happen by adding weight decay terms which keep them small. Small weights are often better for generalization, but this depends on how many topics, K, you choose. The simplest weight decay results in: But a more aggressive variant is to replace

Incremental Updates One idea is to train one factor at a time. It’s fast One can monitor performance on held out test data A new factor is trained on the residuals of the model up till now. Hence, we get: If you fit slow (i.e. don’t go to convergence, weight decay, shrinkage) results are better.

Pre-Processing We are almost ready for implementation. But we can make a head-start if we first remove some “obvious” structure from the data, so the algorithm doesn’t have to search for it. In particular, you can subtract the user and movie means where you ignore missing entries. Nmu is the total number of observed movie-user pairs. Note that you can now predict missing entries by simply using:

Homework Subtract the movie and user means from the netflix data. Check theoretically and experimentally that the both user and movie means are 0. make a prediction for the quiz data and submit it to netflix. (your RMSE should be around 0.99) Implement the stochastic matrix factorization and submit to netflix. (you should get around 0.93)