RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear
K-Modes k-modes algorithm (Huang 1999) is an extension of the k-means algorithm by MacQueen(1967) k-modes aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimized. By default simple-matching distance is used to determine the dissimilarity of two objects. ◦ The simple-matching distance is computed by counting the number of mismatches in all variables. ◦ Alternatively the distance can be weighted by the frequencies of the categories in the data. ◦ An initial matrix of modes can be supplied.
K-Modes Function Part of the klaR package Perform k-modes clustering on categorical data k-modes function usage ◦ kmodes(data, modes, iter.max = 10, weight = FALSE) data is a matrix or data frame of categorical data. Objects have to be in rows and variables in columns. mode is a number of modes or a set of distinct cluster modes. If a number is chosen the initial modes are a random set of distinct rows. iter.max is the maximum number of iterations allowed. weighted is TRUE or FALSE based on whether a usual simple-matching distance between objects is used or a weighted version of this distance is used. k-modes can return the following values: ◦ cluster...a vector of integers indicating the cluster to which each object is allocated. ◦ size...the number of objects in each cluster. ◦ modes...a matrix of cluster modes. ◦ withindiff...the within-cluster distance for each cluster ◦ iterations...the number of iterations the algorithm has run. ◦ weighted...whether weighted distance were used
Data Cleaning Training and Testing data sets contain 12,500 records each. ◦ Clustering performed only on training set. Training and Testing data sets are organized into three fields. ◦ Reviewer ID Number...4 or 5 numeric character string. ◦ Sentiment Value...0 or 1 ◦ Review Text...free text Over 2.91 million words of free text in training set. Data contains some HTML markup and whitespace padding. ◦ Used simple Java regular expression library to remove markup. No data extrapolation measures needed.
Data Kmodes was performed on the training set. ◦ BOWTrainVectorized.txt 12,500 objects each Feature vector consist of 2 categorical variables and 7 numeric variables Reviewer ID...Identifies the reviewer...may not be unique Sentiment Value...Binary value (1) = positive and (0) = negative. Total Word Count...Number of all word in the review text. Stopword Count...Number of words in the review text that are stopwords. Useful Word Count...Total Word Count – Stopword Count. Good Adjective Count...Number of words in the review text that are positive adjectives. Bad Adjective Count...Number of words in the review text that are negative adjectives. Good Phrase Count...Number of words in the review text that are sequential, multiple word strings which represent positive sentiment. Bad Phrase Count...Number of words in the review text that are sequential multiple word strings which represent negative sentiment. Example Vector
Data Summary FeatureMinimumMedianMeanMaximumSum S_value ,312 Twrd_count ,460~2.91mil Swrd_count ,097~1.44mil Uwrd_count ,363~1.47mil Good_Adj 00< 13011,043 Bad_Adj 00< 1159,499 Good_Phr 00< Bad_Phr 00< 11201
Procedure-R script 1.#install required packages 1.install.packages("plyr") 2.install.packages(“klaR") 3.library(plyr) 4.library(klaR) 2.#read the data into a data frame 1.Train_Data<-read.delim(“~BOWTrainVectorized.txt”, header = TRUE, sep =“\t”) 3.#perform kmodes clustering 1.cluster_Train<-kmodes(Train_Data[2:9], 3, iter.max = 3, weighted = FALSE) 4.#create a frequency table to identify each cluster 1.freqTable_Train<-table(cluster_Train$cluster) 5.#create a pie chart of the cluster distribution 1.pie(freqTable_Train, main="Cluster Distribution for Training Set") 6.#append the cluster information to the data frame 1.Train_Data_Mod<-cbind(Train_Data, cluster_Train$cluster) 7.#create a subset of the data frame for each cluster 1.train_cluster1 <-subset(Train_Data_Mod, cluster_Train$cluster==1) 2.train_cluster2 <-subset(Train_Data_Mod, cluster_Train$cluster==2) 3.train_cluster3 <-subset(Train_Data_Mod, cluster_Train$cluster==3) 8.#create cluster sum information for each cluster 1.colSums(train_cluster1[,2:9]) 2.colSums(train_cluster2[,2:9]) 3.colSums(train_cluster3[,2:9]) 9.#create summary statistics for the training set 1.colSums(Train_Data[2:9]) 2.summary(Train_Data[2:9])
Results Characteristics ClusterSizeWithin Cluster Distance 16,48824,803 23,63914,087 32,3738, Distance metric Aggregates ClusterGood_AdjBad_AdjGood_PhrBad_Phr 16, ,9767, , Aggregates ClusterS_valueTwrd_countSwrd_countUwrd_count 14,464~1.4m~720k~733k 214~955k~475k~479k 31,834~ 508k~ 251k~ 257k
Results Cluster 1 Sentiment: Positive Mean Twrd_count: 224 Mean Swrd_count: 110 Mean Uwrd_count: 113 Cluster 3 Sentiment: Positive Mean Twrd_count: 262 Mean Swrd_count: 130 Mean Uwrd_count: 131 Cluster 2 Sentiment: Negative Mean Twrd_count: 214 Mean Swrd_count: 105 Mean Uwrd_count: 108
Analysis Distinct clusters. Cluster have good cohesion. Sentiment homogeneity in cluster 2 is very high. Sentiment homogeneity in cluster 3 is very high. Cluster 2 contains extraordinary high-level of negative sentiment. Good-Bad Adjective and Phrase result is poor among all records.