Page 1 Cloud Study: Algorithm Team Mahout Introduction 박성찬 IDS Lab.
Page 2 Apache Mahout Hadoop based machine learning library Provide CF, clustering, naïve bayesian, frequent pattern mining, SVD, … Latest version is 0.5
Page 3 Install Mahout Prerequisite –Java JDK 1.6 –Maven or Higher Install –Download Mahout source files –Change directory to the checked out directory Pom.xml 이 있는 최상위 디렉토리 –Run “mvn install” –Takes about 30 min
Page 4 Run Mahout Script Mahout Example Script –Provides below capabilities arff.vector: : Generate Vectors from an ARFF file or directory canopy: : Canopy clustering cat: : Print a file or resource as the logistic regression models would see it cleansvd: : Cleanup and verification of SVD output clusterdump: : Dump cluster output to text dirichlet: : Dirichlet Clustering eigencuts: : Eigencuts spectral clustering evaluateFactorization: : compute RMSE of a rating matrix factorization against probes in memory evaluateFactorizationParallel: : compute RMSE of a rating matrix factorization against probes fkmeans: : Fuzzy K-means clustering fpg: : Frequent Pattern Growth itemsimilarity: : Compute the item-item-similarities for item-based collaborative filtering kmeans: : K-means clustering lda: : Latent Dirchlet Allocation ldatopics: : LDA Print Topics lucene.vector: : Generate Vectors from a Lucene index matrixmult: : Take the product of two matrices meanshift: : Mean Shift clustering parallelALS: : ALS-WR factorization of a rating matrix predictFromFactorization: : predict preferences from a factorization of a rating matrix prepare20newsgroups: : Reformat 20 newsgroups data recommenditembased: : Compute recommendations using item-based collaborative filtering rowid: : Map SequenceFile to {SequenceFile, SequenceFile } rowsimilarity: : Compute the pairwise similarities of the rows of a matrix runlogistic: : Run a logistic regression model against CSV data seq2sparse: : Sparse Vector generation from Text sequence files seqdirectory: : Generate sequence files (of Text) from a directory seqdumper: : Generic Sequence File dumper seqwiki: : Wikipedia xml dump to sequence file spectralkmeans: : Spectral k-means clustering splitDataset: : split a rating dataset into training and probe parts ssvd: : Stochastic SVD svd: : Lanczos Singular Value Decomposition testclassifier: : Test Bayes Classifier trainclassifier: : Train Bayes Classifier trainlogistic: : Train a logistic regression using stochastic gradient descent transpose: : Take the transpose of a matrix vectordump: : Dump vectors from a sequence file to text wikipediaDataSetCreator: : Splits data set of wikipedia wrt feature like country wikipediaXMLSplitter: : Reads wikipedia data and creates ch
Page 5 Run Mahout Script Pre Setting –Hadoop should run on the same machine –export JAVA_HOME=/usr/lib/jvm/java-6-sun –export HADOOP_HOME=$HOME/hadoop –export HADOOP_CONF_DIR=$HADOOP_HOME/conf Run –Change directory to [mahout-home-directory]/bin –Run “mahout”
Page 6 Run Mahout Script(Lanczos SVD) Make input file –SVD requires SequenceFile as input file Stores pairs Key : org.apache.hadoop.io.IntWritable Value : org.apache.mahout.math.VectorWritable –You should write your own convertor program. Mahout does not provide one! Key type Value type compression type
Page 7 Run Mahout Script(Lanczos SVD), cont. LOOP { } LOOPEND Adding a vector to SequenceFile
Page 8 Run Mahout Script(Lanczos SVD), cont. Output File is now on HDFS –You can check with “hadoop fs –ls [path]” command
Page 9 Run Mahout Script(Lanczos SVD), cont. Run –Run “mahout svd –i [inputfile_path] –o [outputdir_path] –nr [rowcount] –nc [colunmcount] –r [rank]”
Page 10 Run Mahout Script(Lanczos SVD), cont. Chekcing SVD result If you want to see the content of this result, you should write another convertor program(reverse way) –Use SequenceFile.Reader class and SequentialAccessSparseVector class
Page 11 Using Mahout API Documentation is very poor! JavaDoc is provided, but many functions have no description If you want what do parameters mean, you should read source code T_T