CSE 491/891 Lecture 25 (Mahout).

Slides:



Advertisements
Similar presentations
The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Advertisements

Florida International University COP 4770 Introduction of Weka.
Linear Regression.
Classification. Introduction A discriminant is a function that separates the examples of different classes. For example – IF (income > Q1 and saving >Q2)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Jeff Howbert Introduction to Machine Learning Winter Collaborative Filtering Nearest Neighbor Approach.
Indian Statistical Institute Kolkata
Probabilistic Generative Models Rong Jin. Probabilistic Generative Model Classify instance x into one of K classes Class prior Density function for class.
x – independent variable (input)
Learning From Data Chichang Jou Tamkang University.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 5 Data mining : A Closer Look.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
An Exercise in Machine Learning
 The Weka The Weka is an well known bird of New Zealand..  W(aikato) E(nvironment) for K(nowlegde) A(nalysis)  Developed by the University of Waikato.
More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.
Data Mining Joyeeta Dutta-Moscato July 10, Wherever we have large amounts of data, we have the need for building systems capable of learning information.
Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ICS 178 Introduction Machine Learning & data Mining Instructor max Welling Lecture 6: Logistic Regression.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Cosine Similarity Item Based Predictions 77B Recommender Systems.
The Summary of My Work In Graduate Grade One Reporter: Yuanshuai Sun
An Exercise in Machine Learning
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Machine Learning 5. Parametric Methods.
Collaborative Filtering via Euclidean Embedding M. Khoshneshin and W. Street Proc. of ACM RecSys, pp , 2010.
Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Image taken from: slideshare
Homework 1 Tutorial Instructor: Weidong Shi (Larry), PhD
Big data classification using neural network
Big Data is a Big Deal!.
Who am I? Work in Probabilistic Machine Learning Like to teach 
Machine Learning – Classification David Fenyő
Chapter 7. Classification and Prediction
Machine Learning Logistic Regression
Pfizer HTS Machine Learning Algorithms: November 2002
Multimodal Learning with Deep Boltzmann Machines
Classification with Perceptrons Reading:
Machine Learning Basics
Adopted from Bin UIC Recommender Systems Adopted from Bin UIC.
Machine Learning Logistic Regression
Machine Learning Week 1.
Efficient Image Classification on Vertically Decomposed Data
K Nearest Neighbor Classification
Data Mining Practical Machine Learning Tools and Techniques
Collaborative Filtering Nearest Neighbor Approach
Q4 : How does Netflix recommend movies?
Weka Package Weka package is open source data mining software written in Java. Weka can be applied to your dataset from the GUI, the command line or called.
Prepared by: Mahmoud Rafeek Al-Farra
Machine Learning with Weka
Instance Based Learning
HPML Conference, Lyon, Sept 2018
CSE 491/891 Lecture 21 (Pig).
Multilayer Perceptron & Backpropagation
Charles Tappert Seidenberg School of CSIS, Pace University
Intro to Machine Learning
Machine Learning Algorithms – An Overview
Statistical Models and Machine Learning Algorithms --Review
Naïve Bayes Classifier
Patterson: Chap 1 A Review of Machine Learning
Data Mining CSCI 307, Spring 2019 Lecture 8
Presentation transcript:

CSE 491/891 Lecture 25 (Mahout)

Outline So far, we have looked at Hadoop (for writing native programs to analyze big data on distributed machines) Pig and Hive (for querying and summarizing big data) The next two lectures we will use Mahout for more complex analysis (clustering, classification, etc)

What is Mahout? An open source data analysis software that has implementations for Clustering Classification Collaborative filtering Written in Java; built to take advantage of Hadoop for large-scale analysis problems Available on AWS EMR, but you’ll need to wait for about 10 minutes for it to be launched

List of Mahout Programs

Mahout Programs

Mahout Classification There are several implementations available on Mahout Logistic regression: a linear classifier mahout trainlogistic - to build model from training data mahout runlogistic - to apply model to test data Naïve Bayes classifier: a probabilistic classifier implemented for text classification mahout trainclassifier - to build model from training data mahout testclassifier - to apply model to test data

Logistic Regression Example: Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64

Logistic Regression Fit a linear regression model: y = wx + c Buy = 0.0366 Age – 1.1572 Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Problem: predicted value goes from - to + 

Logistic Regression Classification: Buy = 1 if 0.0366 Age – 1.1572 > 0 −1 otherwise Age Buy 10 -1 15 18 19 24 +1 29 30 31 40 44 55 64 Decision boundary is at 1.1572/0.0366 = 31.6

Logistic Regression Instead of modeling y as a function of x, logistic regression learns a model for p(y|x) Logistic regression: log 𝑝(𝑦=1|𝑥) 𝑝(𝑦=0|𝑥) =𝑤𝑥+𝑏 𝑝(𝑦=1|𝑥) = 1 1+ 𝑒 −(𝑤𝑥+𝑏) Logistic function whose values range between 0 and 1

Logistic Regression Logistic regression: P(Buy|Age) = 1 1+ 𝑒 −0.1085𝐴𝑔𝑒+3.2751 Age Buy 10 15 18 19 24 1 29 30 31 40 44 55 64 Decision boundary is at P(Buy|Age) = 0.5, i.e., when Age = 3.2751/0.1085 = 30.18

Logistic Regression (Summary) A linear classifier that predicts the probability a data instance belongs to some class y Parameters (w,c) are estimated from the training data using an online algorithm known as stochastic gradient descent: wnew = wold +   error(y,x) 𝑝 𝑦 𝑥 = 1 1+ 𝑒 − 𝑤 𝑇 𝑥+𝑐 Learning rate

Mahout Logistic Regression Example: diabetes.csv Classes: positive (+1) or negative (-1) You can store input data on local directory (instead of HDFS) First line contains the names of attributes:

Mahout Logistic Regression To train classifier: mahout trainlogistic <options>

Mahout Logistic Regression

Mahout Logistic Regression Training:

Mahout Logistic Regression To apply classifier to a test set: mahout runlogistic <options>

Mahout Logistic Regression Example: apply model to predict the training set   + - 115 151 153 345 Accuracy = 0.60 Can we improve the results?

Mahout Logistic Regression Change lambda (variable that controls complexity of the model to avoid overfitting)

Mahout Logistic Regression Standardize the predictor attributes (diabetes_s.csv) 𝑥→ 𝑥−𝜇 𝜎

Mahout Logistic Regression Example: apply model to predict the standardized training set   + - 184 118 84 382 Accuracy = 0.74

Mahout Collaborative Filtering Collaborative filtering methods are used to rank the preference of users on various items Mahout provides various ways to implement collaborative filtering approaches (see lecture 15) Nearest-neighbor similarity Matrix factorization Mission Impossible Over the Hedge Back to the Future Harry Potter John 5 3 4 ? Mary Lee 2 Joe 1

Technique: Matrix Factorization Given: ratings matrix R (users x items) Goal: To decompose matrix R into a product of matrices U and MT (the superscript T denote a matrix transpose operation) that best approximates R 5 3 4 ? 1 2 = T  Predicted matrix U  MT user feature matrix U (users  features) item feature matrix M (items  features)

Example: last.FM Playlist Raw data can be downloaded from http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastf m-360K.html Original data contains (user, artist, #plays), i.e., number of times a user plays a song by the artist Data contains 359K users, 300K artists, 17M ratings Task is to predict whether a user likes to play songs by a particular artist

Raw Data Raw data file has 4 tab- separated columns: userID artistID artist-name plays where userID and artistID are identifier strings assigned by MusicBrainz (an open source music encyclopedia) Problems Not all artists have artistID; so use artist name instead Some users do not have profile because they are not registered; so we ignore them Example for userID 0029ef14ff1a743eb44f62b8d87f90f7f44098f0

Data Preprocessing Convert the number of plays into ordinal “ratings” from 1 (seldom plays) to 5 (often plays) Transform the counts for each user into a Z-score by subtracting its mean and dividing by its standard deviation. Assign ratings as follows Some users have played songs by different artists only once; so their standard deviation is 0 (such users will be ignored) Tab-separated file (music.ratings): userID artistID rating Z-score Z  -2 -2 < Z  -1 -1 < Z < 1 1  Z < 2 Z  2 Rating 1 2 3 4 5 #(user,artist) 326 218524 15351861 1102438 884949

Worflow for Mahout Collaborative Filtering Local directory Upload data to HDFS HDFS Mahout splitDataset Training set Test set Mahout parallelALS Mahout evaluateFactorization/ recommendfactorized

Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 1: load the data to lastFM/input on HDFS Step 2: Split data set into a training and a probing (testing) set using mahout splitDataset

Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 3: Factorize the matrix using mahout parallelALS --input: Input directory for training data on HDFS --output: path where output should be saved (output contains user feature matrix U and item feature matrix M) --lambda (double): regularization parameter to avoid overfitting --numFeatures: number of latent factors --numIterations: number of iterations to factorize the matrix

Evaluation (RMSE) If error is around 2, it means an actual rating of 4 could be predicted anywhere between 2 and 5 on average

Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 4: Compute the root mean square error (rmse) of prediction on test (probe) set using mahout evaluateFactorization To view the root mean square error

Mahout’s Recommendation Algorithm Use Mahout’s latent factor approach Step 5: Get the predicted rankings of items in the test (probe) set using mahout recommendfactorized Output in lastFM/output/pred can be compared against true values in lastFM/data/probeSet (rmse is around 0.528 – see step 4)

Mahout Recommendation Algorithm Predicted ratings for each user …