Lecture 6 Statistical Lecture ─ Cluster Analysis.

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.
What is Cluster Analysis?
Clustering.
Hierarchical Clustering
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Introduction to Bioinformatics
Cluster Analysis.
What is Cluster Analysis? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Cluster Analysis (1).
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Descriptive Statistics Used to describe the basic features of the data in any quantitative study. Both graphical displays and descriptive summary statistics.
Chapter 3 – Descriptive Statistics
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
© 2008 Brooks/Cole, a division of Thomson Learning, Inc. 1 Chapter 4 Numerical Methods for Describing Data.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Statistics and Quantitative Analysis U4320 Segment 2: Descriptive Statistics Prof. Sharyn O’Halloran.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현. POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering Procedure Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 16, 2015.
Cluster Analysis Cluster Analysis Cluster analysis is a class of techniques used to classify objects or cases into relatively homogeneous groups.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
 The mean is typically what is meant by the word “average.” The mean is perhaps the most common measure of central tendency.  The sample mean is written.
Cluster Analysis.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition Instructor’s Presentation Slides 1.
Chapter 2: Getting to Know Your Data
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:
CZ5225: Modeling and Simulation in Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
1 Chapter 4 Numerical Methods for Describing Data.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Analyzing Expression Data: Clustering and Stats Chapter 16.
Université d’Ottawa / University of Ottawa 2001 Bio 8100s Applied Multivariate Biostatistics L10.1 Lecture 10: Cluster analysis l Uses of cluster analysis.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
CHAPTER 2: Basic Summary Statistics
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Describing Data: Summary Measures. Identifying the Scale of Measurement Before you analyze the data, identify the measurement scale for each variable.
CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Discrimination and Classification
Data Mining Chapter 4 Cluster Analysis Part 1
Clustering and Multidimensional Scaling
Information Organization: Clustering
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
What Is Good Clustering?
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
CHAPTER 2: Basic Summary Statistics
Hierarchical Clustering
Unsupervised Learning
Presentation transcript:

Lecture 6 Statistical Lecture ─ Cluster Analysis

Cluster Analysis Grouping similar objects to produce a classification Useful when the priori the structure of the data is unknown Involving the assessment of the relative distances between points

Clustering Algorithms Partitioning : Divide the data set into k clusters where k needs to be specified beforehand, e.g. k-means.

Clustering Algorithms Hierarchical : –Agglomerative methods : Start with the situation where each object forms its own little cluster, and then successively merges clusters until only one large cluster left –Divisive methods : Start by considering the whole data set as one cluster, and then splits up clusters until each object is separate

Caution Most users are interested in the main structure of their data, consisting of a few large clusters When forming larger clusters, agglomerative methods might makes wrong decisions in the first step. (Once one step is wrong, the whole thing is wrong) For divisive methods, the larger clusters are determined first, so they are less likely to suffer from earlier steps

Agglomerative Hierarchical Clustering Procedure (1)Each observation begins in a cluster by itself (2)The two closest clusters are merged to from a new cluster that replaces the two old clusters (3)Repeat (2) until only one cluster is left The various clustering methods differ in how the distance between two clusters is computed.

Remarks For coordinate data, variables with large variances tend to have more effect on the resulting clusters than those with small variance Scaling or transforming the variables might be needed Standardization (standardize the variables to mean 0 and standard deviation 1) or principle components is useful but not always appropriate Outliers should be removed before analysis

Remarks(cont.) Nonlinear transformations of the variables may change the number of population clusters and should therefore be approached with caution For most applications, the variables should be transformed so that equal differences are of equal practical importance An interval scale of measurement is required if raw data are used as input. Ordinal or ranked coordinate data are generally not appropriate

Notation nnumber of observation vnumber of variables if data are coordinates Gnumber of clusters at any given level of the hierarchy x i i th observation C k k th cluster, subset of {1, 2, …, n} N k number of observations in C k

Notation(cont.) sample mean vector mean vector for cluster C k ||x||Euclidean length of the vector x, that is the square root of the sum of the squares of the elements of x T W k

Notation(cont.) P G  W j, where summation is over the G clusters at the G th level of the hierarchy B kl W m – W k – W l if C m =C k  C l d(x, y)any distance or dissimilarity measure between observations or vectors x and y D kl any distance or dissimilarity measure between clusters C k and C l

Clustering Method ─ Average Linkage The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then The combinatorial formula is if C m =C k  C l

Average Linkage The distance between clusters is the average distance between pairs of observations, one in each cluster It tends to join clusters with small variance and is slightly biased toward producing clusters with the same variance

Centroid Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is

Centroid Method The distance between two clusters is defined as the squared Euclidean distance between their centroids or means It is more robust to outliers than most other hierarchical methods but in other respects may not perform as well as Ward’s method or average linkage

Complete Linkage The distance between two clusters is defined by The combinatorial formula is

Complete Linkage The distance between two cluster is the maximum distance between an observation in one cluster and an observation in the other cluster It is strongly biased toward producing clusters with roughly equal diameters and can be severely distorted by moderate outliers

Single Linkage The distance between two clusters is defined by The combinatorial formula is

Single Linkage The distance between two clusters is the minimum distance between an observation in one cluster and an observation in the other cluster It sacrifices performance in the recovery of compact clusters in return for the ability to detect elongated and irregular clusters

Ward’s Minimum-Variance Method The distance between two clusters is defined by If d(x, y)=||x – y|| 2, then the combinatorial formula is

Ward’s Minimum-Variance Method The distance between two clusters is the ANOVA sum of squares between the two clusters added up over all the variables It tends to join clusters with a small number of observation It is strongly biased toward producing clusters with roughly the same number of observations It is also very sensitive to outliers

Assumptions for WMVM Multivariate normal mixture Equal spherical covariance matrices Equal sampling probabilities

Remarks Single linkage tends to lead to the formation of long straggly clusters Average, complete linkage and Ward’s method often find spherical clusters even when the data appear to contain clusters of other shapes

McQuitty’s Similarity Analysis The combinatorial formula is Median Method If d(x, y)=||x – y|| 2, then the combinatorial formula is

K th -nearest Neighbor Method Prespecify k Let r k (x) be the distance from point x to the k th nearest observation Consider a closed sphere centered at x with radius r k (x), say S k (x)

K th -nearest Neighbor Method The estimated density at x is defined by For any two observations x i and x j

K-Means Algorithm It is intended for use with large data sets, from approximately 100 to observations With small data sets, the results may be highly sensitive to the order of the observations in the data set It combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distance from the cluster means

K-Means Algorithm Specify the number of clusters, say k A set of k points called cluster seeds is selected as a first guess of the means of the k clusters Each observation is assigned to the nearest seed to form temporary clusters The seeds are then replaced by the means of the temporary clusters The process is repeated until no further changes occur in the clusters

Cluster Seeds Select the first complete (no missing values) observation as the first seed The next complete observation that is separated from the first seed by at least the prespecified distance becomes the second seed Later observations are selected as new seeds if they are separated from all previous seeds by at least the radius, as long as the maximum number of seeds is not exceeded

Cluster Seeds If an observation is complete but fails to qualify as a new seed, two tests can be made to see if the observation can replace one of the old seeds

Cluster Seeds(cont.) An old seed is replaced if the distance between the observation and the closest seed is greater than the minimum distance between seeds. The seed that is replaced is selected from the two seeds that are closest to each other. The seed that is replaced is the one of these two with the shortest distance to the closest of the remaining seed when the other seed is replaced by the current observation

Cluster Seeds(cont.) If the observation fails the first test for seed replacement, a second test is made. The observation replaces the nearest seed if the smallest distance from the observation to all seeds other than the nearest one is greater than the shortest distance from the nearest seed to all other seeds. If this test is failed, go on to the next observation.

Dissimilarity Matrices n  n dissimilarity matrix where d(i, j)=d(j, i) measures the “difference” or dissimilarity between the objects i and j.

Dissimilarity Matrices d usually satisfies d(i, i) = 0 d(i, j)  0 d(i, j) = d(j, i)

Dissimilarity Interval-scaled variables-continuous measurements on a (roughly) linear scale (temperature, height, weight, etc.)

Dissimilarity(cont.) The choice of measurement units strongly affects the resulting clustering The variable with the large dispersion will have the largest impact on clustering If all variables are considered equally important, the data need to be standardized first

Standardization Mean absolute deviation (Robust) Median absolute deviation (Robust) Usual standard deviation

Continuous Ordinal Variables These are continuous measurements on an unknown scale, or where only the ordering is known but not the actual magnitude. Replace the x if by their rank r if  {1, …, M f } Transform the scale to [0,1] as follows : Compute the dissimilarities as for interval- scaled variables

Ratio-Scaled Variables These are positive continuous measurements on a nonlinear scale, such as an exponential scale. One example would be the growth of a bacterial population (say, with a growth function Ae Bt ). Simple as interval-scaled variables, though this is not recommended as it can distort the measurement scale As continuous ordinal data By first transforming the data (perhaps by taking logarithms), and then treating the results as interval- scaled variables

Discrete Ordinal Variables A variable of this type has M possible values (scores) which are ordered. The dissimilarities are computed in the same way as for continuous ordinal variables.

Nominal Variables Such a variable has M possible values, which are not ordered. The dissimilarity between objects i and j is usually defined as

Symmetric Binary Variables Two possible values, coded 0 and 1, which are equally important (s.t. a male and female). Consider the contingency table of the objects i and j :

Asymmetric Binary Variables Two possible values, one of which carries more importance than the other. The most meaningful outcome is coded as 1, and the less meaningful outcome as 0. Typically, 1 stands for the presence of a certain attribute (e.g., a particular distance), and 0 for its absence.

Asymmetric Binary Variables

Cluster Analysis of Flying Mileages Between 10 American Cities 0ATLANTA 587 0CHICAGO DENVER HOUSTON LOS ANGELES MIAMI NEW YORK SAN FRANCISCO SEATTLE WASHINGTON D.C.

The CLUSTER Procedure Average Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm RMS Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL CL8SEATTLE DENVERHOUSTON CL6MIAMI CL3CL CL2CL Root-Mean-Square Distance Between Observations =

Average Linkage Cluster Analysis

The CLUSTER Procedure Centroid Hierarchical Cluster Analysis Cluster History NCLClusters JoinedFREQPSFPST2 Norm Cent Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL CL8SEATTLE DENVERCL CL6MIAMI CL3HOUSTON CL2CL Root-Mean-Square Distance Between Observations =

Centroid Hierarchical Cluster Analysis

The CLUSTER Procedure Single Linkage Cluster Analysis Cluster History NCLClusters JoinedFREQ Norm Min Dist TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACL CL7CHICAGO CL6MIAMI CL8SEATTLE CL5HOUSTON DENVERCL CL3CL Mean Distance Between Observations =

Single Linkage Cluster Analysis

The CLUSTER Procedure Ward's Minimum Variance Cluster Analysis Cluster History NCLClusters JoinedFREQSPRSQRSQPSFPST2 TieTie 9NEW YORKWASHINGTON D.C LOS ANGELESSAN FRANCISCO ATLANTACHICAGO CL7CL DENVERHOUSTON CL8SEATTLE CL6MIAMI CL3CL CL2CL Root-Mean-Square Distance Between Observations =

Ward's Minimum Variance Cluster Analysis

Fisher (1936) Iris Data Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth Minimum Distance Between Initial Seeds = The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02

Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds Convergence criterion is satisfied. Criterion Based on Final Seeds =5.0417

Fisher (1936) Iris Data The FASTCLUS Procedure Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Exceed ed Nearest Cluster Distance Between Cluster Centroids

Fisher (1936) Iris Data The FASTCLUS Procedure Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength SepalWidth PetalLength PetalWidth OVER-ALL Pseudo F Statistic = Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two above values are invalid for correlated variables

c: number of clusters n: number of observations

Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=2 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth

Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica Total

Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Initial Seeds ClusterSepalLengthSepalWidthPetalLengthPetalWidth Minimum Distance Between Initial Seeds =

Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Iteration History IterationCriterion Relative Change in Cluster Seeds Convergence criterion is satisfied.Criterion Based on Final Seeds =3.6289

Fisher (1936) Iris Data Cluster Summary Clust er Freque ncy RMS Std Deviation Maximum Distance from Seed to Observation Radius Excee ded Nearest Cluster Distance Between Cluster Centroids

Fisher (1936) Iris Data Statistics for Variables VariableTotal STDWithin STDR-SquareRSQ/(1-RSQ) SepalLength SepalWidth PetalLength PetalWidth OVER-ALL Pseudo F Statistic =561.63Approximate Expected Over-All R-Squared = Cubic Clustering Criterion = WARNING: The two above values are invalid for correlated variables.

Fisher (1936) Iris Data The FASTCLUS ProcedureReplace=FULL Radius=0 Maxclusters=3 Maxiter=10 Converge=0.02 Cluster Means ClusterSepalLengthSepalWidthPetalLengthPetalWidth Cluster Standard Deviations ClusterSepalLengthSepalWidthPetalLengthPetalWidth

Fisher (1936) Iris Data The FREQ Procedure Frequency Percent Row Pct Col Pct Table of CLUSTER by Species CLUSTER(Cluster) Species Total SetosaVersicolorVirginica Total