Download presentation
Presentation is loading. Please wait.
1
Presented by: Javier Pastorino Fall 2016
Apache Mahout Presented by: Javier Pastorino Fall 2016
2
Agenda About Mahout Installation Algorithms Examples Classification
Clustering
3
About Mahout Environment for quickly create scalable performant Machine Learning Applications Explodes Hadoop for Parallel processing Implements 3 ML Techniques Recommendation: Personal + Community information to make a recommendation. Video Streaming like Netflix and Hulu, Radio like Pandora and Spotify, and Others: like eHarmony, Amazon Classification: known data to classify new data Antispam Systems Clustering: groups data into new categories Youtube:BTI-360
4
Installation Quite easy: Download, Unzip, Ready Pre-Requisites
Java Installed Apache Hadoop Download: Quick Setup: common/SingleCluster.html#Installing_Software Setup Environment Variables MAHOUT_HOME=/path/to/mahout MAHOUT_LOCAL=true #for running standalone on your dev machine, unset for running on a cluster JAVA_HOME=/usr/lib/jvm/java openjdk-amd64/jre HADOOP_HOME=/home/bdlab/hadoop-2.7.3
5
Algorithms Mahout Math-Scala Core Library and Scala DSL
Mahout Distributed BLAS. Distributed Row Matrix API with R and Matlab like operators. Distributed ALS, SPCA, SSVD, thin-QR. Similarity Analysis. Mahout Interactive Shell Interactive REPL shell for Spark optimized Mahout DSL Collaborative Filtering with CLI drivers User-Based Collaborative Filtering Item-Based Collaborative Filtering Matrix Factorization with ALS Matrix Factorization with ALS on Implicit Feedback Weighted Matrix Factorization, SVD++ Classification with CLI drivers Logistic Regression - trained via SGD Naive Bayes / Complementary Naive Bayes Hidden Markov Models Clustering with CLI drivers Canopy Clustering k-Means Clustering Fuzzy k-Means Streaming k-Means Spectral Clustering *Dimensionality Reduction Singular Value Decomposition Lanczos Algorithm Stochastic SVD PCA (via Stochastic SVD) QR Decomposition
6
Agenda About Mahout Installation Algorithms Examples Classification
Clustering
7
Examples - Classify Newsrooms
Create a working directory for the dataset and all input/output. Convert the full 20 newsgroups dataset into a < Text, Text > SequenceFile. $ mahout seqdirectory -i ${WD}/20news-all -o ${WD}/20news-seq -ow Convert and preprocesses the dataset into a < Text, VectorWritable > SequenceFile containing term frequencies for each document. $ mahout seq2sparse -i ${WD}/20news-seq -o ${WD}/20news-vectors -lnorm -nv -wt tfidf Split the preprocessed dataset into training and testing sets. $ mahout split -i ${WD}/20news-vectors/tfidf-vectors --trainingOutput ${WD}/20news-train-vectors --testOutput ${WD}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential Train the classifier. $ mahout trainnb -i${WD}/20news-train-vectors -el -o ${WD}/model -li {WD}/labelindex -ow -c Test the classifier. $ mahout testnb -i ${WD}/20news-test-vectors -m ${WD}/model -l ${WD}/labelindex -ow -o ${WD}/20news-testing -c
8
Examples - Classify Newsrooms Results:
======================================================= Confusion Matrix a b c d e f g h i j k l m n o p q r s t <--Classified as |398 a=rec.motorcycles |395 b=comp.windows.x |376 c=talk.politics.mideast |364 d=talk.politics.guns |251 e=talk.religion.misc |396 f=rec.autos |397 g=rec.sport.baseball |399 h=rec.sport.hockey |385 i=comp.sys.mac.hardware |394 j=sci.space |392 k=comp.sys.ibm.pc.hardware |310 l=talk.politics.misc |389 m=comp.graphics |393 n=sci.electronics |398 o=soc.religion.christian |396 p=sci.med |396 q=sci.crypt |319 r=alt.atheism |390 s=misc.forsale |394 t=comp.os.ms-windows.misc Statistics Kappa Accuracy % Reliability % Reliability (standard deviation)
9
Example – Clustering Reuters
Selects clustering type: kmeans, fuzzykmeans, lda, or streamingkmeans Parse Data: Runs org.apache.lucene.benchmark.utils.ExtractReuters to generate reuters-out from reuters-sgm (the downloaded archive) $MAHOUT org.apache.lucene.benchmark.utils.ExtractReuters ${WD}/reuters-sgm ${WD}/reuters-out Runs seqdirectory to convert reuters-out to SequenceFile format $MAHOUT seqdirectory -i ${WD}/reuters-out -o ${WD}/reuters-out-seqdir -c UTF-8 -chunk 64 -xm sequential Runs seq2sparse to convert SequenceFiles to sparse vector format $MAHOUT seq2sparse -i ${WD}/reuters-out-seqdir/ -o ${WD}/reuters-out-seqdir-sparse-kmeans --maxDFPercent 85 --namedVector Runs k-means with 20 clusters $MAHOUT kmeans -i ${WD}/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c ${WD}/reuters-kmeans-clusters -o ${WD}/reuters-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering Runs clusterdump to show results $MAHOUT clusterdump -i `$DFS -ls -d ${WD}/reuters-kmeans/clusters-*-final | awk '{print $8}'` -o ${WD}/reuters-kmeans/clusterdump -d ${WD}/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir ${WD}/reuters-kmeans/clusteredPoints
10
Example – Clustering Reuters Results:
:{"identifier":"VL-5965","r":[],"c":[{"10":2.643},{"11":2.714},{"16.2":7.612},{"16.9":7.545}, {"17":3.Top Terms: ionics => ,001,000 => ion => ,000 => ,000 => => ,000 => => ,000 => => ,000 => nonrecurring => => => => vs => => backlog => => net => Weight : [props - optional]: Point: Inter-Cluster Density: Intra-Cluster Density: CDbw Inter-Cluster Density: 0.0 CDbw Intra-Cluster Density: CDbw Separation:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.