Lecture 26 (Mahout Clustering)

Lecture 26 (Mahout Clustering)
CSE 491/891 Lecture 26 (Mahout Clustering)

Outline of Lecture Previous lecture This lecture
Introduction to Mahout Classification: logistic regression Collaborative filtering: matrix factorization with ALS This lecture Clustering using Mahout Writing and compiling Java program with Mahout API

Clustering Algorithms in Mahout
Several clustering algorithms available K-means Other algorithms Fuzzy clustering Spectral clustering Latent Dirichlet allocation (a probabilistic clustering)

Clustering Algorithms in Mahout
To use the clustering algorithms, you must first prepare your input data: Data must be stored in HDFS Data must be stored as vectors in sequence file format Mahout defines a Vector interface in the package org.apache.mahout.math.Vector For applications such as document clustering, each document should be stored as a separate file in HDFS (the name of the file will be used to identify the cluster assignment after clustering step has ended)

Document Clustering Suppose we want to cluster 16 scientific articles based on words that appear in the article titles filename Bio1: The sequence of the human genome Bio2: Gene expression profiling predicts clinical outcome of breast cancer Bio3: Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling Bio4: Exhaustive matching of the entire protein sequence database Bio5: Integration of biological networks and gene expression data using Cytoscape Bio6: Combining biological networks to predict genetic interactions Bio7: Deciphering the biology of Mycobacterium tuberculosis from the complete genome sequence Bio8: Quantitative monitoring of gene expression patterns with a complementary DNA microarray Graph1: Network structure and minimum degree Graph2: Graph minors Algorithmic aspects of tree width Graph3: Adaptation algorithms for binary tree networks Graph4: Fast robust BSP tree traversal algorithm for ray tracing Graph5: Approximating maximum clique with a Hopfield network Graph6: Clique partitions, graph compression and speeding-up algorithms Graph7: A graph theoretic generalization of the clique concept Graph8: An introduction to chordal graphs and clique trees

Preprocessing Create a feature vector for each document
Each feature corresponds to a word (term) in the document Need to preprocess the terms (e.g., convert all characters to lower case; remove punctuation marks; etc) Doc sequence gene profiling … graph Bio1 ? Bio2

Preprocessing Need to assign weights to each term in a document
Binary (0/1): presence/absence of a term in the document Limitation: cannot distinguish important words from non- important ones Counts: based on term frequency (TF) in the document Limitation: unable to handle stopwords (words such as “the”, “a”, “of”, that appear frequently in documents)

Preprocessing

K-Means Clustering in Mahout
K-means clustering requires the following A SequenceFile containing the input data to be clustered Distance measure (default is Euclidean distance) Number of clusters Maximum number of iterations Mahout iteratively applies the following steps: Map: assigns each point to its nearest centroid Reduce: recomputes the locations of centroids

Workflow for Document Clustering
Local directory Upload data to HDFS HDFS Mahout seqdirectory Document preprocessing Mahout seq2sparse Mahout k-means Mahout clusterdump

Example: Document Clustering
Step 0: unpack the data files from class webpage hadoop> gzip –d documents.tar.gz hadoop> tar xzf documents.tar

Step 1: Upload the data to HDFS The following command will upload the data from the documents in your local directory to the HDFS directory /user/yourMSU_ID/documents/input Next step is to create feature vectors from the document data and store them in sequence file format We’ll use Mahout’s seqdirectory and seq2sparse programs to do this

Step 2a: Preprocess the data Invoke mahout seqdirectory to transform data into SequenceFile format Options: -i: input directory that contains the document files -o: output directory to store the sequence files -ow: overwrite the output directory (if it already exists) -c : character encoding (UTF-8 for Unicode)

Step 2a: Preprocess the data The documents are stored in sequencefile format (key is filename, value is content of the file)

You can also view the content of a sequencefile using mahout seqdumper command: …

Step 2b: Preprocess the data Invoke mahout seq2sparse to create sparse vectors Options: -i: input directory -o: output directory (that contains feature vectors) -ow: overwrite existing directory -nv: named value (create identifiers for each data instance)

Step 2b: Preprocess the data Invoke mahout seq2sparse to create sparse vectors Other useful options: -s : minimum support (frequency) of a term to be considered as part of dictionary (default = 2) -md: minimum document frequency of a term (default = 1) -x: maximum document frequency of a term -ng: maximum size of n-grams (default = 1) -wt: weighting scheme (e.g., tfidf or tf). Default is tfidf

Step 2b: Preprocess the data The seq2sparse program will create the following dictionary.file: mapping of each term to its integer ID tf-vectors: term frequency feature vector for each document tfidf-vectors: normalized TFIDF vector for each document df: document frequency counts

Let’s view the content of the dictionary file There are only 12 words! What happen to other words? All words have been converted to lower case characters But no stemming (does not remove the trailing “s” or “ing” in a word)

Let’s view the tfidf vectors Mahout uses VectorWritable to store the feature vectors

What have we done so far? Load the document data from local directory to HDFS Preprocess the documents Convert to sequence file format Create sparse TF or TFIDF vectors to represent the documents Now, we’re ready to do the clustering Apply k-means clustering as an example

Step 3: Apply k-means clustering to tfidf vectors Options: -i : input directory (can use tf-vectors or tfidf-vectors) -o: output directory -k: number of clusters -x: maximum number of iterations to execute k-means -c: initial centroids (if k is specified, a random set of points will be selected and written to this directory) -dm: distance measure -cl: assigns the input docs to clusters at the end of the process and puts the results in outputdir/clusteredPoints directory

Note that the output for each iteration of k-means algorithm is stored in the output directory For the document data set, the algorithm converges after 2 iterations (even though we specify a maximum iteration of 10) The final clustering is in the directory clusters-2-final

Step 4: Display the cluster centroids and top terms for each cluster Options: -i : directory that contains results of last k-means iteration -d: dictionary file that maps integer ID to corresponding terms -dt: dictionary type -b: maximum number of characters to display on each line -n: number of top terms to be displayed for each cluster -p: directory that contains the cluster ID of each document -o: output file for storing the clustering results

ClusterDump Output n: # points in the cluster Centroid vector Cluster ID This cluster (VL-6) has 11 documents (all bios and 3 graph documents). Keywords associated with the cluster are gene, expression, sequence, etc

ClusterDump Output This cluster (VL-12) contains 5 documents (all belong to the graph documents). Keywords associated with this cluster include clique, network, graph, and algorithms

Summary To cluster a collection of documents
Store each document as a separate file Upload the documents to HDFS Apply mahout seqdirectory to convert the documents into sequence file format Apply mahout seq2sparse to generate feature vectors (tfidf or tf) and perform other preprocessing Apply mahout kmeans to cluster the vectors Apply mahout clusterdump to display the clustering results

Mahout Clustering The previous example shows how to apply Mahout’s k-means clustering on document data It includes some preprocessing steps that are specific to document data What if we want to cluster other types of data (time series, census data, gene expressions, etc)? Can we still use Mahout k-means?

Mahout Clustering In order to cluster other types of data, we need to make sure the input data is stored on HDFS in sequence file format Key is identifier of the data instance Value is a VectorWritable object Example: suppose you have a CSV file, how do we cluster them? You’ll need to write a program to convert the file into a sequence file with key = record identifier value = VectorWritable object

Example: 2-D CSV Data Suppose you need to cluster the following 2-D data (stored in CSV format) We’ll write a Java program to convert the CSV file into a sequence file of VectorWritables

Using Mahout API You can write a program that converts CSV to sequence file format using the Mahout API The program takes 2 input parameters: Name of input file to be converted (in local directory) Name of output file after conversion (to be stored in HDFS)

csvLoader.java import <packages> public class csvLoader { public List <vector> loadData (String input) { … } public static void genSequenceFile(List<Vector> points, String output, FileSystem fs, Configuration conf) { …. } public static void main(String[] args) throws Exception { 1. Check input parameters 2. Load input data 3. Write the data records into a sequence file on HDFS } }

csvLoader.java import java.io.File; import java.io.IOException; import java.io.BufferedReader; import java.io.FileReader; import java.util.ArrayList; import java.util.Iterator; import java.util.List; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.mahout.math.DenseVector; import org.apache.mahout.math.Vector; import org.apache.mahout.math.VectorWritable; Java libraries Hadoop libraries Mahout libraries

Main Program public static void main(String[] args) throws Exception { if (args.length != 2) { System.err.println("Usage: java csvLoader <input> <output>"); System.exit(1); } List<Vector> vectors = loadData(args[0]); Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(conf); genSequenceFile(vectors, args[1], fs, conf); } } Read input data from local directory and store them as a list of Vectors Write the list of Vectors to HDFS in SequenceFile format: key: record ID Value: VectorWritable

Loading Data from CSV File
public static List<Vector> loadData(String input) { 1. Create a List object to store the feature vectors 2. Read each line (record) of the CSV file - break the line into tokens using comma as delimiter - create a Vector object to store the feature values - add the Vector object to the list 3. Return the List of vectors }

Function to Read from CSV File
public static List<Vector> loadData(String input) throws IOException{ List<Vector> records = new ArrayList<Vector>(); BufferedReader br = new BufferedReader(new FileReader(input)); String line = ""; StringTokenizer st = null; int i; while ((line = br.readLine()) != null) { 1. Parse each line to create a point object 2. Add the point to data records } return records;

Function to Read from CSV File
public static List<Vector> loadData(String input) throws IOException{ … while ((line = br.readLine()) != null) { st = new StringTokenizer(line, ","); ArrayList<Double> weights = new ArrayList<Double>(); while (st.hasMoreTokens()) { weights.add(Double.parseDouble(st.nextToken())); } double[] point = new double[weights.size()]; Iterator<Double> iterator = weights.iterator(); i = 0; while(iterator.hasNext()) { point[i++] = iterator.next().doubleValue(); Vector vec = new DenseVector(point.length); vec.assign(point); records.add(vec); return records;

Function to Create SequenceFile
public static void genSequenceFile(List<Vector> points, String output, FileSystem fs, Configuration conf) throws IOException { 1. Create a SequenceFile writer object 2. For each vector stored in the List - send to the writer a (key,value) pair, where key is record number and value is a vector of feature values 3. Close the SequenceFile writer }

Function to Create SequenceFile
public static void genSequenceFile(List<Vector> points, String output, FileSystem fs, Configuration conf) throws IOException { SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, new Path(output), LongWritable.class, VectorWritable.class); long recNum = 0; VectorWritable vec = new VectorWritable(); for (Vector point : points) { vec.set(point); writer.append(new LongWritable(recNum++), vec); } writer.close(); }

Compilation You need to include the following paths to your CLASSPATH variable: AWS: /usr/lib/hadoop/hadoop-common amzn-1.jar /usr/lib/mahout/mahout-hdfs jar /usr/lib/mahut/mahout-math jar To set the classpath:

Execution We need to merge csvLoader.class with Mahout job jar file in order to execute the program Compile the code Make sure you’ve set the classpath (see previous slide) Copy the mahout job jar file Add class file to mahout job jar file

Execution Now, we can apply k-means on the sequence file
Usage: hadoop jar mahout-core*-job.jar csvLoader <input> <output> Now, we can apply k-means on the sequence file

Clustering Apply k-means (with k=2) Examine the output
Dump output to ASCII text file (2dresults.txt)

Clustering Results

Lecture 26 (Mahout Clustering)

Similar presentations

Presentation on theme: "Lecture 26 (Mahout Clustering)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 26 (Mahout Clustering)

Similar presentations

Presentation on theme: "Lecture 26 (Mahout Clustering)"— Presentation transcript:

Similar presentations

About project

Feedback