Presented by: Javier Pastorino Fall 2016

Presented by: Javier Pastorino Fall 2016
Apache Mahout Presented by: Javier Pastorino Fall 2016

Agenda About Mahout Installation Algorithms Examples Classification
Clustering

About Mahout Environment for quickly create scalable performant Machine Learning Applications Explodes Hadoop for Parallel processing Implements 3 ML Techniques Recommendation: Personal + Community information to make a recommendation. Video Streaming like Netflix and Hulu, Radio like Pandora and Spotify, and Others: like eHarmony, Amazon Classification: known data to classify new data Antispam Systems Clustering: groups data into new categories Youtube:BTI-360

Installation Quite easy: Download, Unzip, Ready Pre-Requisites
Java Installed Apache Hadoop Download: Quick Setup: common/SingleCluster.html#Installing_Software Setup Environment Variables MAHOUT_HOME=/path/to/mahout MAHOUT_LOCAL=true #for running standalone on your dev machine, unset for running on a cluster JAVA_HOME=/usr/lib/jvm/java openjdk-amd64/jre HADOOP_HOME=/home/bdlab/hadoop-2.7.3

Algorithms Mahout Math-Scala Core Library and Scala DSL
Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis. Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL Collaborative Filtering with CLI drivers User-Based Collaborative Filtering Item-Based Collaborative Filtering Matrix Factorization with ALS Matrix Factorization with ALS on Implicit Feedback Weighted Matrix Factorization, SVD++ Classification with CLI drivers Logistic Regression - trained via SGD Naive Bayes / Complementary Naive Bayes Hidden Markov Models Clustering with CLI drivers Canopy Clustering k-Means Clustering Fuzzy k-Means Streaming k-Means Spectral Clustering *Dimensionality Reduction Singular Value Decomposition Lanczos Algorithm Stochastic SVD PCA (via Stochastic SVD) QR Decomposition

Agenda About Mahout Installation Algorithms Examples Classification
Clustering

Examples - Classify Newsrooms
Create a working directory for the dataset and all input/output. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. $ mahout seqdirectory -i ${WD}/20news-all -o ${WD}/20news-seq -ow Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. $ mahout seq2sparse -i ${WD}/20news-seq -o ${WD}/20news-vectors -lnorm -nv -wt tfidf Split the preprocessed dataset into training and testing sets. $ mahout split -i ${WD}/20news-vectors/tfidf-vectors --trainingOutput ${WD}/20news-train-vectors --testOutput ${WD}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential Train the classifier. $ mahout trainnb -i${WD}/20news-train-vectors -el -o ${WD}/model -li {WD}/labelindex -ow -c Test the classifier. $ mahout testnb -i ${WD}/20news-test-vectors -m ${WD}/model -l ${WD}/labelindex -ow -o ${WD}/20news-testing -c

Example – Clustering Reuters
Selects clustering type: kmeans, fuzzykmeans, lda, or streamingkmeans Parse Data: Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate reuters-out from reuters-sgm (the downloaded archive) $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WD}/reuters-sgm ${WD}/reuters-out Runs seqdirectory to convert reuters-out to SequenceFile format $MAHOUT seqdirectory -i ${WD}/reuters-out -o ${WD}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential Runs seq2sparse to convert SequenceFiles to sparse vector format $MAHOUT seq2sparse -i ${WD}/reuters-out-seqdir/ -o ${WD}/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector Runs k-means with 20 clusters $MAHOUT kmeans -i ${WD}/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c ${WD}/reuters-kmeans-clusters -o ${WD}/reuters-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering Runs clusterdump to show results $MAHOUT clusterdump -i `$DFS -ls -d ${WD}/reuters-kmeans/clusters-*-final | awk '{print $8}'` -o ${WD}/reuters-kmeans/clusterdump -d ${WD}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir ${WD}/reuters-kmeans/clusteredPoints

Example – Clustering Reuters Results:
:{"identifier":"VL-5965","r":[],"c":[{"10":2.643},{"11":2.714},{"16.2":7.612},{"16.9":7.545}, {"17":3.Top Terms: ionics => ,001,000 => ion => ,000 => ,000 => => ,000 => => ,000 => => ,000 => nonrecurring => => => => vs => => backlog => => net => Weight : [props - optional]: Point: Inter-Cluster Density: Intra-Cluster Density: CDbw Inter-Cluster Density: 0.0 CDbw Intra-Cluster Density: CDbw Separation:

Presented by: Javier Pastorino Fall 2016

Similar presentations

Presentation on theme: "Presented by: Javier Pastorino Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by: Javier Pastorino Fall 2016

Similar presentations

Presentation on theme: "Presented by: Javier Pastorino Fall 2016"— Presentation transcript:

Similar presentations

About project

Feedback