CSE 491/891 Lecture 25 (Mahout).

CSE 491/891 Lecture 25 (Mahout)

Outline So far, we have looked at
Hadoop (for writing native programs to analyze big data on distributed machines) Pig and Hive (for querying and summarizing big data) The next two lectures we will use Mahout for more complex analysis (clustering, classification, etc)

What is Mahout? An open source data analysis software that has implementations for Clustering Classification Collaborative filtering Written in Java; built to take advantage of Hadoop for large-scale analysis problems Available on AWS EMR, but you’ll need to wait for about 10 minutes for it to be launched

List of Mahout Programs

Mahout Programs

Mahout Classification
There are several implementations available on Mahout Logistic regression: a linear classifier mahout trainlogistic - to build model from training data mahout runlogistic - to apply model to test data Naïve Bayes classifier: a probabilistic classifier implemented for text classification mahout trainclassifier - to build model from training data mahout testclassifier - to apply model to test data

Logistic Regression Example: Age Buy 10 -1 15 18 19 24 +1 29 30 31 40
44 55 64

Logistic Regression Fit a linear regression model: y = wx + c
Buy = Age – Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Problem: predicted value goes from - to + 

Logistic Regression Classification:
Buy = 1 if Age – > 0 −1 otherwise Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Decision boundary is at / = 31.6

Logistic Regression Instead of modeling y as a function of x, logistic regression learns a model for p(y|x) Logistic regression: log 𝑝(𝑦=1|𝑥) 𝑝(𝑦=0|𝑥) =𝑤𝑥+𝑏 𝑝(𝑦=1|𝑥) = 1 1+ 𝑒 −(𝑤𝑥+𝑏) Logistic function whose values range between 0 and 1

Logistic Regression Logistic regression:
P(Buy|Age) = 1 1+ 𝑒 −0.1085𝐴𝑔𝑒 Age Buy 10 15 18 19 24 1 29 30 31 40 44 55 64 Decision boundary is at P(Buy|Age) = 0.5, i.e., when Age = / = 30.18

Logistic Regression (Summary)
A linear classifier that predicts the probability a data instance belongs to some class y Parameters (w,c) are estimated from the training data using an online algorithm known as stochastic gradient descent: wnew = wold +   error(y,x) 𝑝 𝑦 𝑥 = 1 1+ 𝑒 − 𝑤 𝑇 𝑥+𝑐 Learning rate

Mahout Logistic Regression
Example: diabetes.csv Classes: positive (+1) or negative (-1) You can store input data on local directory (instead of HDFS) First line contains the names of attributes:

To train classifier: mahout trainlogistic <options>

Training:

To apply classifier to a test set: mahout runlogistic <options>

Example: apply model to predict the training set + - 115 151 153 345 Accuracy = 0.60 Can we improve the results?

Change lambda (variable that controls complexity of the model to avoid overfitting)

Standardize the predictor attributes (diabetes_s.csv) 𝑥→ 𝑥−𝜇 𝜎

Example: apply model to predict the standardized training set + - 184 118 84 382 Accuracy = 0.74

Mahout Collaborative Filtering
Collaborative filtering methods are used to rank the preference of users on various items Mahout provides various ways to implement collaborative filtering approaches (see lecture 15) Nearest-neighbor similarity Matrix factorization Mission Impossible Over the Hedge Back to the Future Harry Potter John 5 3 4 ? Mary Lee 2 Joe 1

Technique: Matrix Factorization
Given: ratings matrix R (users x items) Goal: To decompose matrix R into a product of matrices U and MT (the superscript T denote a matrix transpose operation) that best approximates R 5 3 4 ? 1 2 = T  Predicted matrix U  MT user feature matrix U (users  features) item feature matrix M (items  features)

Example: last.FM Playlist
Raw data can be downloaded from m-360K.html Original data contains (user, artist, #plays), i.e., number of times a user plays a song by the artist Data contains 359K users, 300K artists, 17M ratings Task is to predict whether a user likes to play songs by a particular artist

Raw Data Raw data file has 4 tab- separated columns:
userID artistID artist-name plays where userID and artistID are identifier strings assigned by MusicBrainz (an open source music encyclopedia) Problems Not all artists have artistID; so use artist name instead Some users do not have profile because they are not registered; so we ignore them Example for userID 0029ef14ff1a743eb44f62b8d87f90f7f44098f0

Data Preprocessing Convert the number of plays into ordinal “ratings” from 1 (seldom plays) to 5 (often plays) Transform the counts for each user into a Z-score by subtracting its mean and dividing by its standard deviation. Assign ratings as follows Some users have played songs by different artists only once; so their standard deviation is 0 (such users will be ignored) Tab-separated file (music.ratings): userID artistID rating Z-score Z  -2 -2 < Z  -1 -1 < Z < 1 1  Z < 2 Z  2 Rating 1 2 3 4 5 #(user,artist) 326 218524 884949

Worflow for Mahout Collaborative Filtering
Local directory Upload data to HDFS HDFS Mahout splitDataset Training set Test set Mahout parallelALS Mahout evaluateFactorization/ recommendfactorized

Mahout’s Recommendation Algorithm
Use Mahout’s latent factor approach Step 1: load the data to lastFM/input on HDFS Step 2: Split data set into a training and a probing (testing) set using mahout splitDataset

Use Mahout’s latent factor approach Step 3: Factorize the matrix using mahout parallelALS --input: Input directory for training data on HDFS --output: path where output should be saved (output contains user feature matrix U and item feature matrix M) --lambda (double): regularization parameter to avoid overfitting --numFeatures: number of latent factors --numIterations: number of iterations to factorize the matrix

Evaluation (RMSE) If error is around 2, it means an actual rating of 4 could be predicted anywhere between 2 and 5 on average

Use Mahout’s latent factor approach Step 4: Compute the root mean square error (rmse) of prediction on test (probe) set using mahout evaluateFactorization To view the root mean square error

Use Mahout’s latent factor approach Step 5: Get the predicted rankings of items in the test (probe) set using mahout recommendfactorized Output in lastFM/output/pred can be compared against true values in lastFM/data/probeSet (rmse is around – see step 4)

Mahout Recommendation Algorithm
Predicted ratings for each user …

CSE 491/891 Lecture 25 (Mahout).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 25 (Mahout)."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 491/891 Lecture 25 (Mahout).

Similar presentations

Presentation on theme: "CSE 491/891 Lecture 25 (Mahout)."— Presentation transcript:

Similar presentations

About project

Feedback