Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.

Slides:



Advertisements
Similar presentations
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Advertisements

Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
N. Kumar, Asst. Professor of Marketing Database Marketing Cluster Analysis.
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
Mutual Information Mathematical Biology Seminar
Multivariate Data Analysis Chapter 4 – Multiple Regression.
What is Cluster Analysis?
Multivariate Data Analysis Chapter 9 - Cluster Analysis
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Segmentation Analysis
Evaluating Performance for Data Mining Techniques
Cluster Analysis Chapter 12.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
CLUSTER ANALYSIS.
© 2007 Prentice Hall20-1 Chapter Twenty Cluster Analysis.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
es/by-sa/2.0/. Principal Component Analysis & Clustering Prof:Rui Alves Dept Ciencies Mediques.
1 Cluster Analysis Objectives ADDRESS HETEROGENEITY Combine observations into groups or clusters such that groups formed are homogeneous (similar) within.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Cluster Analysis.
1 Hair, Babin, Money & Samouel, Essentials of Business Research, Wiley, Learning Objectives: 1.Explain the difference between dependence and interdependence.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Selecting Diverse Sets of Compounds C371 Fall 2004.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Copyright © 2010 Pearson Education, Inc., publishing as Prentice-Hall. 9-1 Chapter 9 Cluster Analysis.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Copyright © 2010 Pearson Education, Inc Chapter Twenty Cluster Analysis.
Clustering / Scaling. Cluster Analysis Objective: – Partitions observations into meaningful groups with individuals in a group being more “similar” to.
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Basic statistical concepts Variance Covariance Correlation and covariance Standardisation.
Chapter_20 Cluster Analysis Naresh K. Malhotra
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Unsupervised Learning
Multivariate statistical methods
Clustering CSC 600: Data Mining Class 21.
Chapter 15 – Cluster Analysis
Clustering based on book chapter Cluster Analysis in Multivariate Analysis by Hair, Anderson, Tatham, and Black.
Lecturing 12 Cluster Analysis
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Revision (Part II) Ke Chen
Clustering and Multidimensional Scaling
Revision (Part II) Ke Chen
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Chapter_20 Cluster Analysis
Cluster Analysis.
Clustering The process of grouping samples so that the samples are similar within each group.
Cluster analysis Presented by Dr.Chayada Bhadrakom
Unsupervised Learning
Presentation transcript:

Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9

Cluster Analysis Classification according to certain characteristics Widely used technique –Target marketing of groups –Biological classification –Classifying a number of observations into a smaller number of more manageable groups without losing information

Cluster Analysis Used to identify groups or clusters of homogeneous individuals Observations in each cluster are similar to each other. Homogeneous within clusters Observations from one cluster are different from those from other clusters. Heterogeneous between clusters

Cluster Analysis - An Example.. 1 Income and Education are the clustering variables A Income Education B CD E F

Cluster Analysis - An Example.. 2 Use squared Euclidean distances and the centroid to measure distance from a cluster Similarity M’x based on squared distances (5-6) 2 +(5-6) 2 (15-5) 2 +(14-5) 2 (25-30) 2 +(20-19) 2

Cluster Analysis - An Example.. 3 The observations A-B and C-D are close together & the 1st cluster could be formed by combining either pair. Choose A-B The centroid for this cluster is (5.5,5.5). Use this to calculate the similarity matrix Repeat the process combining the next pair (or cluster) of observations

Cluster Analysis - An Example.. 4 Agglomeration Cluster Solution Min. No. of Step Dist 2 Obs. Clusters Clusters 0 (A)(B)(C)(D)(E)(F) A-B (A-B)(C)(D)(E)(F) C-D (A-B)(C-D)(E)(F) E-F (A-B)(C-D)(E-F) (C-D-E-F) (A-B)(C-D-E-F) ALL (A-B-C-D-E-F) 1

Cluster Analysis - An Example.. 5 Graphical representation of the heirarchial clustering process Dendrogram Distance A B C D E F

Cluster Analysis - An Example.. 6 Determining the ‘best’ number of clusters. Fairly subjective decision. Can use a rapid increase in the agglomeration index (Dist 2 ) as a guide For this example, there’s a large increase between Steps 3 (3 clusters) and 4 (2 clusters) Suggests 3 clusters are suitable for these observations. The dendrogram also indicates 3 as a suitable number of clusters.

Stage 1.. The Problem Objectives of Cluster Analysis Taxonomical description. –Forming a taxonomy - an empirical classification Data simplification. –Grouping similar observations to simplify the following analyses Relationship Identification –Identifying relationships between observations

Stage 2.. Design the Analysis Selection of the clustering variables –The derived clusters reflect the inherent structure only as defined by the clustering variate –Use theoretical, conceptual and practical considerations to select the clustering variate Outliers –Errors or are some groups under-represented ? –Can use profile diagrams. Tedious.

Stage 2.. Design the Analysis.. 2 Observation Profile

Stage 2.. Design the Analysis.. 3 Measures of similarity. –Correlation –Distance (Most common) –Association (Applicable with non-metric data) Distance Corelation

Stage 2.. Design the Analysis.. 3 Measures of Similarity- Distance A O B Euclidian Distance = (A-O) 2 +(B-O) 2 = (X 1A -X 1B ) 2 +(X 2A -X 2B ) 2 Block Distance = |A-O| + |B-O| = | X 1A -X 1B | + | X 2A -X 2B | X1X1 X2X2 X 2A X 2B X 1A X 1B  (X iA -X iB ) P i=1 n k

P is the number of variables

Stage 2.. Design the Analysis.. 4 Standardizing the data –Scaling alters the Euclidean distances and the relative importance of each characteristic (Time measured in hours is 60 times less influential than time measured in minutes) –When ever conceptually possible, variables should be standardized - expressed as the no. of s.d.’s from the mean –multicollinearity implicitly increases the weights of the multicollinear characteristics

Standardizing the Data CASE-WISE STANDARDIZATION VARIABLE-WISE STANDARDIZATION

Stage 3.. Assumptions of Cluster Analysis No important assumptions It is mostly mathematical analysis Statistical foundations are weak

Stage 4.. Deriving the Clusters 2 main clustering algorithms –Hierarchical –Non-hierarchical Hierarchical algorithms. Illustrated by early example –agglomerative or divisive procedures –several measures of the distance between clusters

Stage 4.. Deriving the Clusters Measuring the Distance Between Clusters

Centroid. Distance from the cluster centroids Single linkage or nearest neighbor. Minimum distance between members of the separate clusters Complete linkage or farthest neighbor. Maximum distance between members of the separate clusters. Ward’s method. The within cluster sum of squares is minimized over all clusters

Stage 4.. Measuring the Distance Between Clusters - Centroid + +

Stage 4.. Measuring the Distance Between Clusters - Single Linkage

Stage 4. Measuring the Distance Between Clusters - Complete Linkage

Stage 4. Measuring the Distance Between Clusters - Ward’s Method SS 1 SS 2 SS 3 SS 4 Min { (SS 1 +SS 2 ),(SS 3 +SS 4 ) }

R Q P

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering Start by selecting ‘cluster seeds’ as cluster centres Sequential threshold. Cluster all observations within a specified distance of the seed. Then add extra seeds. Parallel threshold. Select several seeds and assign objects within the threshold distance to the closest seed. Optimization. Allows observations to be moved to a cluster that has become closer Selection of cluster seeds alters the clusters obtained

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 1

Stage 4.. Deriving the Clusters Non-Hierarchical Clustering- Stage 2 + +

Stage 4.. Deriving the clusters Choosing between algorithms Problems with hierarchical methods –Influenced by outliers –Not amenable to analyzing very large samples (> 500) Problems with non-hierarchical methods –solution depends on the choice of seeds Perhaps a combination of methods gives the best result. –Use hierarchical method to find suitable seeds and then a non-hierarchical method

Stage 5. Interpreting the Clusters Examine each cluster to assign a label describing the nature of the cluster Interpreting the clusters can confirm prior theories. Can check preconceived typology

Stage 6. Validation Ensure practical significance of clusters Use profile analysis to examine the results

Summary Cluster analysis is an art more than a science! Different measures and different algorithms can effect the results Final selection of the clusters is based on both objective and subjective considerations

LIFE STYLE SEGMENTATION AN APPLICATION

VARIABLES

DESCRIPTIVES

SEGMENT PROFILES NO STANDARDIZATION

EXAMPLE

1-Çok uygun 2-Uygun 3-Uygun değil 4-Hiç uygun değil