RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear.

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

1 JavaScript: Control Structures II. 2 whileCounter.html 1 2
Learning Objectives Copyright © 2002 South-Western/Thomson Learning Data Processing and Fundamental Data Analysis CHAPTER fourteen.
Learning Objectives 1 Copyright © 2002 South-Western/Thomson Learning Data Processing and Fundamental Data Analysis CHAPTER fourteen.
K Means Clustering , Nearest Cluster and Gaussian Mixture
AP Statistics Section 14.2 A. The two-sample z procedures of chapter 13 allowed us to compare the proportions of successes in two groups (either two populations.
Introduction to Bioinformatics
Grouping Data Methods of cluster analysis. Goals 1 1.We want to identify groups of similar artifacts or features or sites or graves, etc that represent.
CSC 4630 Meeting 9 February 14, 2007 Valentine’s Day; Snow Day.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
Using Objects and Properties
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
A Guide to SQL, Seventh Edition. Objectives Retrieve data from a database using SQL commands Use compound conditions Use computed columns Use the SQL.
Microsoft Access 2010 Chapter 7 Using SQL.
Baburao Kamble (Ph.D) University of Nebraska-Lincoln Data Analysis Using R Week2: Data Structure, Types and Manipulation in R.
Categorical Data Prof. Andy Field.
AP Statistics Section 14.2 A. The two-sample z procedures of chapter 13 allowed us to compare the proportions of successes in two groups (either two populations.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
Chapter 3 Single-Table Queries
Data Analysis Using SPSS
Introduction to SPSS Edward A. Greenberg, PhD
Chapter 11 Creating Formulas that Count and Sum Microsoft Excel 2003.
Data Mining & Knowledge Discovery Lecture: 2 Dr. Mohammad Abu Yousuf IIT, JU.
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Chapter Twelve Copyright © 2006 John Wiley & Sons, Inc. Data Processing, Fundamental Data Analysis, and Statistical Testing of Differences.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
A Fuzzy k-Modes Algorithm for Clustering Categorical Data
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Christian A. Cumbaa and Igor Jurisica Division of Signaling Biology, Ontario Cancer Institute, Toronto,
RazorFish Data Exploration-KMeans
XP New Perspectives on XML, 2 nd Edition Tutorial 7 1 TUTORIAL 7 CREATING A COMPUTATIONAL STYLESHEET.
DTC Quantitative Methods Summary of some SPSS commands Weeks 1 & 2, January 2012.
Chapter 15 The Chi-Square Statistic: Tests for Goodness of Fit and Independence PowerPoint Lecture Slides Essentials of Statistics for the Behavioral.
INVITATION TO Computer Science 1 11 Chapter 2 The Algorithmic Foundations of Computer Science.
Data & Graphing vectors data frames importing data contingency tables barplots 18 September 2014 Sherubtse Training.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
ELEC692 VLSI Signal Processing Architecture Lecture 12 Numerical Strength Reduction.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Lecture 5 – Function (Part 2) FTMK, UTeM – Sem /2014.
Lecture 11 Introduction to R and Accessing USGS Data from Web Services Jeffery S. Horsburgh Hydroinformatics Fall 2013 This work was funded by National.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Chapter Fourteen Copyright © 2004 John Wiley & Sons, Inc. Data Processing and Fundamental Data Analysis.
Web Database Programming Using PHP
Analysis Manager Training Module
EMPA Statistical Analysis
Chapter 7. Classification and Prediction
Programming in R Intro, data and programming structures
Digital Text and Data Processing
Web Database Programming Using PHP
Arrays An Array is an ordered collection of variables
Introduction to Python
Topic 4: Exploring Categorical Data
Algorithm Discovery and Design
Learning to count: quantifying signal
CSCI N317 Computation for Scientific Applications Unit R
Text Categorization Berlin Chen 2003 Reference:
Warmup A teacher is compiling information about his students. He asks for name, age, student ID, GPA and whether they ride the bus to school. For.
Presentation transcript:

RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear

K-Modes k-modes algorithm (Huang 1999) is an extension of the k-means algorithm by MacQueen(1967) k-modes aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimized. By default simple-matching distance is used to determine the dissimilarity of two objects. ◦ The simple-matching distance is computed by counting the number of mismatches in all variables. ◦ Alternatively the distance can be weighted by the frequencies of the categories in the data. ◦ An initial matrix of modes can be supplied.

K-Modes Function Part of the klaR package Perform k-modes clustering on categorical data k-modes function usage ◦ kmodes(data, modes, iter.max = 10, weight = FALSE)  data is a matrix or data frame of categorical data. Objects have to be in rows and variables in columns.  mode is a number of modes or a set of distinct cluster modes. If a number is chosen the initial modes are a random set of distinct rows.  iter.max is the maximum number of iterations allowed.  weighted is TRUE or FALSE based on whether a usual simple-matching distance between objects is used or a weighted version of this distance is used. k-modes can return the following values: ◦ cluster...a vector of integers indicating the cluster to which each object is allocated. ◦ size...the number of objects in each cluster. ◦ modes...a matrix of cluster modes. ◦ withindiff...the within-cluster distance for each cluster ◦ iterations...the number of iterations the algorithm has run. ◦ weighted...whether weighted distance were used

Data Cleaning Training and Testing data sets contain 12,500 records each. ◦ Clustering performed only on training set. Training and Testing data sets are organized into three fields. ◦ Reviewer ID Number...4 or 5 numeric character string. ◦ Sentiment Value...0 or 1 ◦ Review Text...free text Over 2.91 million words of free text in training set. Data contains some HTML markup and whitespace padding. ◦ Used simple Java regular expression library to remove markup. No data extrapolation measures needed.

Data Kmodes was performed on the training set. ◦ BOWTrainVectorized.txt  12,500 objects each  Feature vector consist of 2 categorical variables and 7 numeric variables  Reviewer ID...Identifies the reviewer...may not be unique  Sentiment Value...Binary value (1) = positive and (0) = negative.  Total Word Count...Number of all word in the review text.  Stopword Count...Number of words in the review text that are stopwords.  Useful Word Count...Total Word Count – Stopword Count.  Good Adjective Count...Number of words in the review text that are positive adjectives.  Bad Adjective Count...Number of words in the review text that are negative adjectives.  Good Phrase Count...Number of words in the review text that are sequential, multiple word strings which represent positive sentiment.  Bad Phrase Count...Number of words in the review text that are sequential multiple word strings which represent negative sentiment. Example Vector

Data Summary FeatureMinimumMedianMeanMaximumSum S_value ,312 Twrd_count ,460~2.91mil Swrd_count ,097~1.44mil Uwrd_count ,363~1.47mil Good_Adj 00< 13011,043 Bad_Adj 00< 1159,499 Good_Phr 00< Bad_Phr 00< 11201

Procedure-R script 1.#install required packages 1.install.packages("plyr") 2.install.packages(“klaR") 3.library(plyr) 4.library(klaR) 2.#read the data into a data frame 1.Train_Data<-read.delim(“~BOWTrainVectorized.txt”, header = TRUE, sep =“\t”) 3.#perform kmodes clustering 1.cluster_Train<-kmodes(Train_Data[2:9], 3, iter.max = 3, weighted = FALSE) 4.#create a frequency table to identify each cluster 1.freqTable_Train<-table(cluster_Train$cluster) 5.#create a pie chart of the cluster distribution 1.pie(freqTable_Train, main="Cluster Distribution for Training Set") 6.#append the cluster information to the data frame 1.Train_Data_Mod<-cbind(Train_Data, cluster_Train$cluster) 7.#create a subset of the data frame for each cluster 1.train_cluster1 <-subset(Train_Data_Mod, cluster_Train$cluster==1) 2.train_cluster2 <-subset(Train_Data_Mod, cluster_Train$cluster==2) 3.train_cluster3 <-subset(Train_Data_Mod, cluster_Train$cluster==3) 8.#create cluster sum information for each cluster 1.colSums(train_cluster1[,2:9]) 2.colSums(train_cluster2[,2:9]) 3.colSums(train_cluster3[,2:9]) 9.#create summary statistics for the training set 1.colSums(Train_Data[2:9]) 2.summary(Train_Data[2:9])

Results Characteristics ClusterSizeWithin Cluster Distance 16,48824,803 23,63914,087 32,3738, Distance metric Aggregates ClusterGood_AdjBad_AdjGood_PhrBad_Phr 16, ,9767, , Aggregates ClusterS_valueTwrd_countSwrd_countUwrd_count 14,464~1.4m~720k~733k 214~955k~475k~479k 31,834~ 508k~ 251k~ 257k

Results Cluster 1 Sentiment: Positive Mean Twrd_count: 224 Mean Swrd_count: 110 Mean Uwrd_count: 113 Cluster 3 Sentiment: Positive Mean Twrd_count: 262 Mean Swrd_count: 130 Mean Uwrd_count: 131 Cluster 2 Sentiment: Negative Mean Twrd_count: 214 Mean Swrd_count: 105 Mean Uwrd_count: 108

Analysis Distinct clusters. Cluster have good cohesion. Sentiment homogeneity in cluster 2 is very high. Sentiment homogeneity in cluster 3 is very high. Cluster 2 contains extraordinary high-level of negative sentiment. Good-Bad Adjective and Phrase result is poor among all records.