RazorFish Data Exploration-KMeans

Slides:



Advertisements
Similar presentations
K-means Clustering Given a data point v and a set of points X,
Advertisements

Clustering Basic Concepts and Algorithms
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
K Means Clustering , Nearest Cluster and Gaussian Mixture
CO-CLUSTERING USING CUDA. Co-Clustering Explained  Problem:  Large binary matrix of samples (rows) and features (columns)  What samples should be grouped.
Introduction to Bioinformatics
Cluster Analysis.
Grouping Data Methods of cluster analysis. Goals 1 1.We want to identify groups of similar artifacts or features or sites or graves, etc that represent.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Cluster Analysis (1).
Adapted by Doug Downey from Machine Learning EECS 349, Bryan Pardo Machine Learning Clustering.
Clustering a.j.m.m. (ton) weijters The main idea is to define k centroids, one for each cluster (Example from a K-clustering tutorial of Teknomo, K.
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Intro. ANN & Fuzzy Systems Lecture 21 Clustering (2)
Evaluating Performance for Data Mining Techniques
Mathcad Variable Names A string of characters (including numbers and some “special” characters (e.g. #, %, _, and a few more) Cannot start with a number.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
: Chapter 10: Image Recognition 1 Montri Karnjanadecha ac.th/~montri Image Processing.
Math 5364 Notes Chapter 8: Cluster Analysis Jesse Crawford Department of Mathematics Tarleton State University.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Scott Marino MSMIS Kean University MSAS5104 Programming with Data Structures and Algorithms Week 10 Scott Marino.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Session 3: More features of R and the Central Limit Theorem Class web site: Statistics for Microarray Data Analysis.
Author: Zhexue Huang Advisor: Dr. Hsu Graduate: Yu-Wei Su
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 4b, February 20, 2015 Lab: regression, kNN and K- means results, interpreting and evaluating models.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic:
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear.
Chapter 14 – Cluster Analysis © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 29 Nov 11, 2005 Nanjing University of Science & Technology.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
LISA Short Course Series Multivariate Clustering Analysis in R Yuhyun Song Nov 03, 2015 LISA: Multivariate Clustering Analysis in RNov 3, 2015.
Machine Learning Queens College Lecture 7: Clustering.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5b, February 21, 2014 Interpreting regression, kNN and K-means results, evaluating models.
Information-Theoretic Co- Clustering Inderjit S. Dhillon et al. University of Texas, Austin presented by Xuanhui Wang.
Towards Nonlinear Multimaterial Topology Optimization using Machine Learning and Metamodel-based Optimization Kai Liu Purdue University Andrés Tovar.
Engineering Analysis ENG 3420 Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
1 Cluster Analysis – 2 Approaches K-Means (traditional) Latent Class Analysis (new) by Jay Magidson, Statistical Innovations based in part on a presentation.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Debrup Chakraborty Non Parametric Methods Pattern Recognition and Machine Learning.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
K-MEANS CLUSTERING. INTRODUCTION- What is clustering? Clustering is the classification of objects into different groups, or more precisely, the partitioning.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Semi-Supervised Clustering
Clustering 1 (Introduction and kmean)
Clustering (3) Center-based algorithms Fuzzy k-means
Arrays An Array is an ordered collection of variables
Introduction to Python
Outline Altering flow of control Boolean expressions
A Closer Look at Clustering in S-Plus
Simple Kmeans Examples
3.3 Network-Centric Community Detection
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
The structure of programming
Week 7 REU Nolan Warner.
Presentation transcript:

RazorFish Data Exploration-KMeans Data Exploration utilizing the K-Means Clustering algorithm Performed By: Hilbert G Locklear

K-Means The k-means algorithm of Hartigan and Wong (1979) is used by default. This is an improvement of the of the algorithm given by MacQueen (1967). k-means aims to partition the points into k groups such that the sum of squares from points to the assigned cluster centers is minimized. At the minimum, all cluster centers are at th mean of the set of data points which are nearest to the cluster center. (Voronoi set) Multiple random restarts are used to ensure a stable clustering is produced...if one exists. k = 1 is allowed which returns the center of the data set and wss.

K-Means Function Part of the stats package. Performs k-means clustering on a data matrix. k-means function usage kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = “Hartigan-Wong”, trace = FALSE) x... is a numeric data matrix centers...either the number of cluster or a set of distinct cluster centers. If a random set of rows in x is chosen as the initial centers. iter.max...the maximum number of iterations allowed. nstart...if centers is a number, the number of random sets to be used. algorithm...the implementation to be used. Hartigan-Wong Llyod Forgy MacQueen trace...if true, tracing information on the progress of the algorithm is produced. k-means returns an object of the class kmeans which has a print and a fitted method. fitted(object, method = c(“centers”,”classes), ...)

K-Means Function k-modes can return the following values: cluster...a vector of integers (1:k) which indicates which cluster a point is assigned to. centers...a matrix of cluster centers. totss...the total sum of squares. withinss...a vector of within-cluster sum of squares...each element is the wss for a cluster. tot.withinss...total within-cluster sum of squares. betweenss...the between-cluster sum of squares. size...the number of points in each cluster. iter...the number of (outer) iterations.

Data Kmeans was performed on both the training and testing data sets. BOWTrainVectorized.txt and BOWTestVectorized.txt 12,500 objects each Feature vector consist of 2 categorical variables and 7 numeric variables Reviewer ID...Identifies the reviewer...may not be unique Sentiment Value...Binary value (1) = positive and (0) = negative. Total Word Count...Number of all word in the review text. Stopword Count...Number of words in the review text that are stopwords. Useful Word Count...Total Word Count – Stopword Count. Good Adjective Count...Number of words in the review text that are positive adjectives. Bad Adjective Count...Number of words in the review text that are negative adjectives. Good Phrase Count...Number of words in the review text that are sequential, multiple word strings which represent positive sentiment. Bad Phrase Count...Number of words in the review text that are sequential multiple word strings which represent negative sentiment. Example Vector Feature Vector R_ID S_value Twrd_count Swrd_count Uwrd_count Good_Adj Bad_Adj Good_Phr Bad_Phr 0001_1 256 20 236 10 2 1

Procedure-R script #install required packages install.packages(“stats") library(stats) #read the data into a data frame Train_Data<-read.delim(“~BOWTrainVectorized.txt”, header = TRUE, sep =“\t”) #perform kmeans clustering TrainDataCluster<-kmeans(Train_Data[2:9], 3, iter.max=3, nstart=1,algorithm="Hartigan-Wong“, trace=FALSE) TrainDataCluster

Results Kmeans Cluster Distribution k = 3 Cluster 1 8,762 reviews Mixed sentiment Mean Word Count: 145 Stop Word Count: 72 Useable Word Count: 73 Cluster 2 2,903 reviews Negative sentiment Mean Word Count: 357 Stop Word Count: 176 Useable Word Count: 181 Cluster 3 835 reviews Positive sentiment Mean Word Count: 726 Stop Word Count: 359 Useable Word Count: 367 Kmeans Cluster Distribution k = 3

Analysis Distinct clusters. Cluster have good cohesion. High positive sentiment implies high mean word count.