CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Slides:



Advertisements
Similar presentations
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Advertisements

Clustering.
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
PARTITIONAL CLUSTERING
Chapter 12: Cluster analysis and segmentation of customers
Data Mining Techniques: Clustering
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Clustering II.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
What is Cluster Analysis?
Multivariate Data Analysis Chapter 9 - Cluster Analysis
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
Segmentation Analysis
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
DATA MINING CLUSTERING K-Means.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
Selecting Diverse Sets of Compounds C371 Fall 2004.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering.
Machine Learning Queens College Lecture 7: Clustering.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Chapter_20 Cluster Analysis Naresh K. Malhotra
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Unsupervised Learning
Data Mining: Basic Cluster Analysis
Clustering CSC 600: Data Mining Class 21.
Data Mining K-means Algorithm
Topic 3: Cluster Analysis
CSE 5243 Intro. to Data Mining
Clustering and Multidimensional Scaling
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
Chapter_20 Cluster Analysis
Cluster Analysis.
Topic 5: Cluster Analysis
SEEM4630 Tutorial 3 – Clustering.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

CLUSTER ANALYSIS

Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.  It is a data reduction tool that creates subgroups that are more manageable than individual datum.  It does not require any prior knowledge about which elements belong to which clusters Rahul Chandra

Purpose  Cluster analysis (CA) is an exploratory data analysis tool for organizing observed data (e.g. people, things, events, brands, companies) into meaningful, groups, or  clusters, based on combinations of IV’s, which maximizes the similarity of cases within each cluster while maximizing the dissimilarity between groups that are initially unknown. Rahul Chandra

Example Rahul Chandra

CLUSTERED PREFERENCE

Commercial applications  A chain of radio-stores uses cluster analysis for identifying three different customer types with varying needs.  An insurance company is using cluster analysis for classifying customers into segments like the “self confident customer”, “the price conscious customer” etc.  A producer of copying machines succeeds in classifying industrial customers into “satisfied” and “non-satisfied or quarrelling” customers. Rahul Chandra

Overview of clustering methods Name in SPSS Between-groups linkage Within-groups linkage Nearest neighbour Furthest neighbour Centroid clustering Median clustering Ward’s method K-means cluster (Factor) HierarchicalNon-hierarchical/ Partitioning/k-means Agglomerative Divisive - Sequential threshold - Parallel threshold - Neural Networks - Optimized partitioning (8) Non-overlapping (Exclusive) Methods Overlapping Methods Non-hierarchical - Overlapping k-centroids -Overlapping k-means - Latent class techniques - Fuzzy clustering - Q-type Factor analysis (9) Linkage Methods Centroid Methods Variance Methods - Centroid (5) - Median (6) - Average - Between (1) - Within (2) - Weighted - Single - Ordinary (3) - Density - Two stage Density - Complete (4) - Ward (7) Note: Methods in italics are available In SPSS. Neural networks necessitate SPSS’ data mining tool Clementine Rahul Chandra

CONDUCTING CLUSTER ANALYSIS  First you need to have pool of observations/things needs to be grouped.  Defining the variables on which the clustering will be based.  Collect data on the Selected variables.  Select a suitable clustering method  Measuring the inter respondents distance by using a distance formula.  Select a suitable Linkage rule. Rahul Chandra

Major Clustering Approaches  Partitioning approach: Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors.  Hierarchical approach: Create a hierarchical decomposition of the set of data (or objects) using some criterion. Rahul Chandra

Partitioning Algorithms  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Rahul Chandra

The K-Means Clustering Method  Given k, the k-means algorithm is implemented as:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment Rahul Chandra

The K-Means Clustering Method  Example K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign Rahul Chandra

Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition. Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 Agglomerative Divisive Rahul Chandra

Agglomerative Nesting  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster Rahul Chandra

Divisive Analysis  Inverse order of Agglomerate  Eventually each node forms a cluster on its own Rahul Chandra

Distance Between Objects  Distances are normally used to measure the similarity or dissimilarity between two data objects. The most commonly used method to calculate distance is Euclidean distance. Distance between two objects i, and j on p dimensions is given as, Rahul Chandra

Euclidean distance Example of Euclidean distance between two points A and B on two dimensional space. * A B X Y (x 1, y 1 ) (x 2, y 2 ) y 2 -y 1 x 2 -x 1 * d = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Rahul Chandra

Alternatives to Calculate the Distance between Clusters  Single linkage:  Complete Linkage  Average Linkage  Ward Method Rahul Chandra

Single linkage  S mallest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = min(t ip, t jq ) Rahul Chandra

Single linkage 7,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

Complete linkage  L argest distance between an element in one cluster and an element in the other, i.e., dis(K i, K j ) = max(t ip, t jq ) Rahul Chandra

Complete linkage 10,5 9,5 * A * B * C * H * G * D * E Rahul Chandra

Average Linkage  It calculates the average of distances between all the possible pairs contained in both the clusters being combined. Rahul Chandra

Average linkage 9,0 8,5 * A * B * C * H * G * D * E Rahul Chandra

Wards Method This method is distinct from other methods because it uses an analysis of variance approach to evaluate the distances between clusters. In general, this method is very efficient. Cluster membership is assessed by calculating the total sum of squared deviations from the mean of a cluster. The criterion for fusion is that it should produce the smallest possible increase in the error sum of squares. Rahul Chandra

Step 0: Each observation is treated as a separate cluster Distance Measure Dendrogram OBS 1 OBS 2 OBS 3 OBS 4 OBS 5 OBS 6 0,2 0,4 0,6 0,8 1,0 * * * * * * Rahul Chandra

k-means clustering  This method of clustering is very different from the hierarchical clustering and Ward method, which are applied when there is no prior knowledge of how many clusters there may be or what they are characterized by.  K-means clustering is used when you already have hypotheses concerning the number of clusters in your cases or variables. Rahul Chandra

k-means clustering  Very frequently, both the hierarchical and the k- means techniques are used successively.  The former (Ward’s method) is used to get some sense of the possible number of clusters  and the way they merge as seen from the dendrogram.  Then the clustering is rerun with only a chosen optimum number in which to place all  the cases (k means clustering). Rahul Chandra

ANOVA Test in clustering  The cluster centroids produced by SPSS are essentially means of the cluster score for the elements of cluster. Then we usually examine the means for each cluster on each dimension using ANOVA to assess how distinct our clusters are. Ideally, we would obtain significantly  different means for most, if not all dimensions, used in the analysis. Rahul Chandra

Example  A keep fit gym group wants to determine the best grouping of their customers with regard to the type of fitness work programs.  A hierarchical analysis is run and three major clusters stand out between everyone being initially in a separate cluster and the final one cluster.  This is then quantified using a k-means cluster analysis with three clusters, which reveals that the means of different measures of physical fitness measures do indeed produce the three clusters (i.e. customers in cluster 1 are high on measure 1, low on measure 2, etc.). Rahul Chandra

SPSS Output Rahul Chandra

SPSS Outputs Rahul Chandra

SPSS Output Rahul Chandra

SPSS Output Rahul Chandra