Cluster Analysis Hal Whitehead BIOL4062/5062. What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
Data Mining Techniques: Clustering
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
AEB 37 / AE 802 Marketing Research Methods Week 7
Cluster Analysis.
Statistics for Marketing & Consumer Research Copyright © Mario Mazzocchi 1 Cluster Analysis (from Chapter 12)
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
From: McCune, B. & J. B. Grace Analysis of Ecological Communities. MjM Software Design, Gleneden Beach, Oregon
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Cluster Analysis: Basic Concepts and Algorithms
What is Cluster Analysis?
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
CLUSTERING (Segmentation)
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Clustering Unsupervised learning Generating “classes”
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Multivariate Data Analysis  G. Quinn, M. Burgman & J. Carey 2003.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
Quantitative analysis of 2D gels Generalities. Applications Mutant / wild type Physiological conditions Tissue specific expression Disease / normal state.
Clustering.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Selecting Diverse Sets of Compounds C371 Fall 2004.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Computational Biology Clustering Parts taken from Introduction to Data Mining by Tan, Steinbach, Kumar Lecture Slides Week 9.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Applied Multivariate Statistics Cluster Analysis Fall 2015 Week 9.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
1 Cluster Analysis Prepared by : Prof Neha Yadav.
Multivariate statistical methods Cluster analysis.
Data Mining and Text Mining. The Standard Data Mining process.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
K-means and Hierarchical Clustering
Clustering and Multidimensional Scaling
Multivariate Statistical Methods
Data Mining – Chapter 4 Cluster Analysis Part 2
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Clustering The process of grouping samples so that the samples are similar within each group.
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Cluster Analysis Hal Whitehead BIOL4062/5062

What is cluster analysis? Non-hierarchical cluster analysis –K-means Hierarchical divisive cluster analysis Hierarchical agglomerative cluster analysis –Linkage: single, complete, average, … –Cophenetic correlation coefficient Additive trees Problems with cluster analyses

Cluster Analysis “Classification” Maximize within cluster homogeneity (similar individuals within cluster) “The Search for Discontinuities” Discontinuities: places to put divisions between clusters ?

Discontinuities: Discontinuities generally present: taxonomy social organization community ecology??

Types of cluster analysis: Uses: data, dissimilarity, similarity matrix Non-hierarchical –K-means Hierarchical –Hierarchical divisive (repeated K-means, network methods) –Hierarchical agglomerative single linkage, average linkage,... Additive trees

Non-hierarchical Clustering Techniques: K-Means Uses data matrix with Euclidean distances Maximizes between-cluster variance for given number of clusters –i.e. Choose clusters to maximize F-ratio in 1- way MANOVA

K-Means Works iteratively: 1. Choose number of clusters 2. Assigns points to clusters Randomly or some other clustering technique 3. Moves each point to other clusters in turn-- increase in between cluster variance? 4. Repeat step 3. until no improvement possible

K-means with three clusters

Variable Between SS df Within SS df F-ratio X Y ** TOTAL **

K-means with three clusters Cluster 1 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case | X Case | Y Case | Case | Cluster 2 of 3 contains 4 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case | X Case | Y Case | Case | Cluster 3 of 3 contains 2 cases Members Statistics Case Distance | Variable Minimum Mean Maximum St.Dev. Case | X Case | Y

Disadvantages of K-means Reaches optimum, but not necessarily global Must choose number of clusters before analysis –How many clusters?

Example: Sperm whale codas Patterned series of clicks: | | | | | ic1 ic2 ic3 ic4 For 5-click codas: 681 x 4 data set

5-click codas: | | | | | ic1 ic2 ic3 ic4 93% of variance in 2 PC’s

5-click codas: K-means with 10 clusters

Hierarchical Cluster Analysis Usually represented by: –Dendrogram or tree-diagram

Hierarchical Cluster Analysis Hierarchical Divisive Cluster Analysis Hierarchical Agglomerative Cluster Analysis

Hierarchical Divisive Cluster Analysis Starts with all units in one cluster, successively splits them –Successive use of K-Means, or some other divisive technique, with n=2 –Either: Each time use the cluster with the greatest sum of squared distances –Or: Split each cluster each time. Hierarchical divisive are good techniques, but rarely used, outside network analysis

Hierarchical Agglomerative Cluster Analysis Start with each individual units occupying its own cluster The clusters are then gradually merged until just one is left The most common cluster analyses

Hierarchical Agglomerative Cluster Analysis Works on dissimilarity matrix or negative similarity matrix may be Euclidean, Penrose, … distances At each step: 1. There is a symmetric matrix of dissimilarities between clusters 2. The two clusters with least dissimilarity are merged 3. The dissimilarity between the new (merged) cluster and all others is calculated Different techniques do step 3. in different ways:

Hierarchical Agglomerative Cluster Analysis ABCDE A0.... B C D E AD BCE AD0... B?0.. C? E? First link A and D How to calculate new disimmilarities?

Hierarchical Agglomerative Cluster Analysis Single Linkage ABCDE A0.... B C D E AD BCE AD0... B C? E? d(AD,B)=Min{d(A,B), d(D,B)}

Hierarchical Agglomerative Cluster Analysis Complete Linkage ABCDE A0.... B C D E AD BCE AD0... B C? E? d(AD,B)=Max{d(A,B), d(D,B)}

Hierarchical Agglomerative Cluster Analysis Average Linkage ABCDE A0.... B C D E AD BCE AD0... B C? E? d(AD,B)=Mean{d(A,B), d(D,B)}

Hierarchical Agglomerative Cluster Analysis Centroid Clustering (uses data matrix, or true distance matrix) V1V2V3 A B C D E F G V1(AD)=Mean{V1(A),V1(D)} V1V2V3 AD B C E F G

Hierarchical Agglomerative Cluster Analysis Ward’s Method Minimizes within-cluster sum-of squares Similar to centroid clustering

Hierarchical Agglomerative Clustering Techniques Single Linkage –Produces “straggly” clusters –Not recommended if much experimental error –Used in taxonomy –Invariant to transformations Complete Linkage –Produces “tight” clusters –Not recommended if much experimental error –Invariant to transformations Average Linkage, Centroid, Ward’s –Most likely to mimic input clusters –Not invariant to transformations in dissimilarity measure

Cophenetic Correlation Coefficient CCC Correlation between original disimilarity matrix and dissimilarity inferred from cluster analysis CCC >~ 0.8 indicate a good match CCC <~ 0.8, dendrogram not a good representation –probably should not be displayed Use CCC to choose best linkage method (highest coefficient)

CCC=0.83 CCC=0.75 CCC=0.77 CCC=0.80

Additive trees Dendrogram in which path lengths represent dissimilarities Computation quite complex (cross between agglomerative techniques and multidimensional scaling) Good when data are measured as dissimilarities Often used in taxonomy and genetics ABCDE A..... B C D E

Problems with Cluster Analysis Are there really biologically-meaningful clusters in the data? Does the dendrogram represent biological reality (web-of-life versus tree-of-life)? How many clusters to use? – stopping rules are arbitrary Which method to use? –best technique is data-dependent Dendrograms become messy with many units

Social Structure of 160 northern bottlenose whales

Clustering Techniques Type Technique Use Non-hierarchical K-Means Dividing data sets Hierarchical divisive Repeated K-means Good technique on small data sets Network methods... Hierarchical agglomerative Single linkage Taxonomy Complete linkage Tighter Clusters Average linkage, Centroid, Ward’s Usually Preferred HierarchicalAdditive treesExcellent for displaying dissimilarity; taxonomy, genetics