RazorFish Data Exploration-KMeans Data Exploration utilizing the K-Means Clustering algorithm Performed By: Hilbert G Locklear
K-Means The k-means algorithm of Hartigan and Wong (1979) is used by default. This is an improvement of the of the algorithm given by MacQueen (1967). k-means aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at th mean of the set of data points which are nearest to the cluster center. (Voronoi set) Multiple random restarts are used to ensure a stable clustering is produced...if one exists. k = 1 is allowed which returns the center of the data set and wss.
K-Means Function Part of the stats package. Performs k-means clustering on a data matrix. k-means function usage kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = “Hartigan-Wong”, trace = FALSE) x... is a numeric data matrix centers...either the number of cluster or a set of distinct cluster centers. If a random set of rows in x is chosen as the initial centers. iter.max...the maximum number of iterations allowed. nstart...if centers is a number, the number of random sets to be used. algorithm...the implementation to be used. Hartigan-Wong Llyod Forgy MacQueen trace...if true, tracing information on the progress of the algorithm is produced. k-means returns an object of the class kmeans which has a print and a fitted method. fitted(object, method = c(“centers”,”classes), ...)
K-Means Function k-modes can return the following values: cluster...a vector of integers (1:k) which indicates which cluster a point is assigned to. centers...a matrix of cluster centers. totss...the total sum of squares. withinss...a vector of within-cluster sum of squares...each element is the wss for a cluster. tot.withinss...total within-cluster sum of squares. betweenss...the between-cluster sum of squares. size...the number of points in each cluster. iter...the number of (outer) iterations.
Data Kmeans was performed on both the training and testing data sets. BOWTrainVectorized.txt and BOWTestVectorized.txt 12,500 objects each Feature vector consist of 2 categorical variables and 7 numeric variables Reviewer ID...Identifies the reviewer...may not be unique Sentiment Value...Binary value (1) = positive and (0) = negative. Total Word Count...Number of all word in the review text. Stopword Count...Number of words in the review text that are stopwords. Useful Word Count...Total Word Count – Stopword Count. Good Adjective Count...Number of words in the review text that are positive adjectives. Bad Adjective Count...Number of words in the review text that are negative adjectives. Good Phrase Count...Number of words in the review text that are sequential, multiple word strings which represent positive sentiment. Bad Phrase Count...Number of words in the review text that are sequential multiple word strings which represent negative sentiment. Example Vector Feature Vector R_ID S_value Twrd_count Swrd_count Uwrd_count Good_Adj Bad_Adj Good_Phr Bad_Phr 0001_1 256 20 236 10 2 1
Procedure-R script #install required packages install.packages(“stats") library(stats) #read the data into a data frame Train_Data<-read.delim(“~BOWTrainVectorized.txt”, header = TRUE, sep =“\t”) #perform kmeans clustering TrainDataCluster<-kmeans(Train_Data[2:9], 3, iter.max=3, nstart=1,algorithm="Hartigan-Wong“, trace=FALSE) TrainDataCluster
Results Kmeans Cluster Distribution k = 3 Cluster 1 8,762 reviews Mixed sentiment Mean Word Count: 145 Stop Word Count: 72 Useable Word Count: 73 Cluster 2 2,903 reviews Negative sentiment Mean Word Count: 357 Stop Word Count: 176 Useable Word Count: 181 Cluster 3 835 reviews Positive sentiment Mean Word Count: 726 Stop Word Count: 359 Useable Word Count: 367 Kmeans Cluster Distribution k = 3
Analysis Distinct clusters. Cluster have good cohesion. High positive sentiment implies high mean word count.