CSE 491/891 Lecture 25 (Mahout)
Outline So far, we have looked at Hadoop (for writing native programs to analyze big data on distributed machines) Pig and Hive (for querying and summarizing big data) The next two lectures we will use Mahout for more complex analysis (clustering, classification, etc)
What is Mahout? An open source data analysis software that has implementations for Clustering Classification Collaborative filtering Written in Java; built to take advantage of Hadoop for large-scale analysis problems Available on AWS EMR, but you’ll need to wait for about 10 minutes for it to be launched
List of Mahout Programs
Mahout Programs
Mahout Classification There are several implementations available on Mahout Logistic regression: a linear classifier mahout trainlogistic - to build model from training data mahout runlogistic - to apply model to test data Naïve Bayes classifier: a probabilistic classifier implemented for text classification mahout trainclassifier - to build model from training data mahout testclassifier - to apply model to test data
Logistic Regression Example: Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64
Logistic Regression Fit a linear regression model: y = wx + c Buy = 0.0366 Age – 1.1572 Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Problem: predicted value goes from - to +
Logistic Regression Classification: Buy = 1 if 0.0366 Age – 1.1572 > 0 −1 otherwise Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Decision boundary is at 1.1572/0.0366 = 31.6
Logistic Regression Instead of modeling y as a function of x, logistic regression learns a model for p(y|x) Logistic regression: log 𝑝(𝑦=1|𝑥) 𝑝(𝑦=0|𝑥) =𝑤𝑥+𝑏 𝑝(𝑦=1|𝑥) = 1 1+ 𝑒 −(𝑤𝑥+𝑏) Logistic function whose values range between 0 and 1
Logistic Regression Logistic regression: P(Buy|Age) = 1 1+ 𝑒 −0.1085𝐴𝑔𝑒+3.2751 Age Buy 10 15 18 19 24 1 29 30 31 40 44 55 64 Decision boundary is at P(Buy|Age) = 0.5, i.e., when Age = 3.2751/0.1085 = 30.18
Logistic Regression (Summary) A linear classifier that predicts the probability a data instance belongs to some class y Parameters (w,c) are estimated from the training data using an online algorithm known as stochastic gradient descent: wnew = wold + error(y,x) 𝑝 𝑦 𝑥 = 1 1+ 𝑒 − 𝑤 𝑇 𝑥+𝑐 Learning rate
Mahout Logistic Regression Example: diabetes.csv Classes: positive (+1) or negative (-1) You can store input data on local directory (instead of HDFS) First line contains the names of attributes:
Mahout Logistic Regression To train classifier: mahout trainlogistic <options>
Mahout Logistic Regression
Mahout Logistic Regression Training:
Mahout Logistic Regression To apply classifier to a test set: mahout runlogistic <options>
Mahout Logistic Regression Example: apply model to predict the training set + - 115 151 153 345 Accuracy = 0.60 Can we improve the results?
Mahout Logistic Regression Change lambda (variable that controls complexity of the model to avoid overfitting)
Mahout Logistic Regression Standardize the predictor attributes (diabetes_s.csv) 𝑥→ 𝑥−𝜇 𝜎
Mahout Logistic Regression Example: apply model to predict the standardized training set + - 184 118 84 382 Accuracy = 0.74
Mahout Collaborative Filtering Collaborative filtering methods are used to rank the preference of users on various items Mahout provides various ways to implement collaborative filtering approaches (see lecture 15) Nearest-neighbor similarity Matrix factorization Mission Impossible Over the Hedge Back to the Future Harry Potter John 5 3 4 ? Mary Lee 2 Joe 1
Technique: Matrix Factorization Given: ratings matrix R (users x items) Goal: To decompose matrix R into a product of matrices U and MT (the superscript T denote a matrix transpose operation) that best approximates R 5 3 4 ? 1 2 = T Predicted matrix U MT user feature matrix U (users features) item feature matrix M (items features)
Example: last.FM Playlist Raw data can be downloaded from http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastf m-360K.html Original data contains (user, artist, #plays), i.e., number of times a user plays a song by the artist Data contains 359K users, 300K artists, 17M ratings Task is to predict whether a user likes to play songs by a particular artist
Raw Data Raw data file has 4 tab- separated columns: userID artistID artist-name plays where userID and artistID are identifier strings assigned by MusicBrainz (an open source music encyclopedia) Problems Not all artists have artistID; so use artist name instead Some users do not have profile because they are not registered; so we ignore them Example for userID 0029ef14ff1a743eb44f62b8d87f90f7f44098f0
Data Preprocessing Convert the number of plays into ordinal “ratings” from 1 (seldom plays) to 5 (often plays) Transform the counts for each user into a Z-score by subtracting its mean and dividing by its standard deviation. Assign ratings as follows Some users have played songs by different artists only once; so their standard deviation is 0 (such users will be ignored) Tab-separated file (music.ratings): userID artistID rating Z-score Z -2 -2 < Z -1 -1 < Z < 1 1 Z < 2 Z 2 Rating 1 2 3 4 5 #(user,artist) 326 218524 15351861 1102438 884949
Worflow for Mahout Collaborative Filtering Local directory Upload data to HDFS HDFS Mahout splitDataset Training set Test set Mahout parallelALS Mahout evaluateFactorization/ recommendfactorized
Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 1: load the data to lastFM/input on HDFS Step 2: Split data set into a training and a probing (testing) set using mahout splitDataset
Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 3: Factorize the matrix using mahout parallelALS --input: Input directory for training data on HDFS --output: path where output should be saved (output contains user feature matrix U and item feature matrix M) --lambda (double): regularization parameter to avoid overfitting --numFeatures: number of latent factors --numIterations: number of iterations to factorize the matrix
Evaluation (RMSE) If error is around 2, it means an actual rating of 4 could be predicted anywhere between 2 and 5 on average
Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 4: Compute the root mean square error (rmse) of prediction on test (probe) set using mahout evaluateFactorization To view the root mean square error
Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 5: Get the predicted rankings of items in the test (probe) set using mahout recommendfactorized Output in lastFM/output/pred can be compared against true values in lastFM/data/probeSet (rmse is around 0.528 – see step 4)
Mahout Recommendation Algorithm Predicted ratings for each user …