Download presentation
Presentation is loading. Please wait.
Published byBrittney Stephens Modified over 6 years ago
1
Charity Morgan Functional Data Analysis April 12, 2005
Cluster Analysis Charity Morgan Functional Data Analysis April 12, 2005
2
Sources Everitt, B. S. (1979). Unresolved Problems in Cluster Analysis. Biometrics, 35, Romesburg, H. C. (1984). Cluster Analysis for Researchers. Lifetime Learning Publications: Belmont. Everitt, B. S., Landau, S., & Leese, M. (2001). Cluster Analysis. Oxford University Press: New York
3
Outline Motivation Introduction Method Measure Proximity
Choose Clustering Method Hierarchical Clustering Optimization Clustering Select Best Clustering
4
Motivation - An Example
Dataset in this presentation comes from a paper on infant temperament. [Stern, H. S., Arcus, D., Kagan, J., Rubin, D. B., & Snidman, N. (1995). Using Mixture Models in Temperament Research. International Journal of Behavioral Development, 18, ] 76 infants were measured on 3 dimensions: motor activity (Motor), irritability (Cry), and fear response (Fear).
5
Motivation – An Example
6
Motivation Given a data set, can we find natural groupings in the data? How can we decide how many groups exist? Could there be subgroups within the groups?
7
Introduction – What is Cluster Analysis?
Cluster analysis is a method to uncover groups in data. The group memberships of the data points are not known at the outset. Data points are placed into groups based on how “close” or “far apart” they are from each other.
8
Introduction – Examples of Cluster Analysis
Astronomy: Faundez-Abans et al. (1996) used cluster analysis to classify 192 planetary nebulae. Psychiatry: Pilowsky et al. (1969) clustered 200 patients, using their responses to a depression symptom questionnaire. Archaeology: Hodson (1971) used a clustering technique to group hand axes found in the British Isles.
9
Methods – Measurement of Proximity
Given n individuals X1,…,Xn, where Xi = (xi1,…,xip), we will create a dissimilarity matrix, D, where dij is the distance between individual i and individual j. There are many ways of defining distance.
10
Methods – Measurement of Proximity
11
Methods – Hierarchical Clustering
Data is not partitioned into a set number of classes, but classification consists of a series of partitions. Results can be presented as a diagram known as a dendrogram. Can be agglomerative or divisive.
12
Methods – Hierarchical Clustering
Agglomerative: first partition is n single member clusters; last partition is one cluster containing all n individuals. Divisive: first partition is one cluster containing all n individuals; last partition is n single member clusters.
13
Methods – Agglomerative Clustering Methods
Single Linkage (Nearest Neighbor) Distance between groups is defined as that of the closest pair of individuals. Only need proximity matrix, not the original data. Tends to produce unbalanced and straggly clusters, especially in large data sets.
14
Methods – Agglomerative Clustering Methods
15
Methods – Agglomerative Clustering Methods
16
Methods – Agglomerative Clustering Methods
Add individual 3 to the cluster containing individuals 4 and 5. Then merge the groups (1,2) and (3,4,5) into a single cluster.
17
Methods – Agglomerative Clustering Methods
18
Methods – Agglomerative Clustering Methods
Complete Linkage (Furthest Neighbor) Distance between groups is that of the furthest pair of individuals. Tends to find compact clusters with equal diameters. Centroid Clustering Distance between groups is the distance between their centers. Requires original data.
19
Methods – Agglomerative Clustering Methods
20
Methods – Agglomerative Clustering Methods
21
Methods – Agglomerative Clustering Methods
22
Methods – Agglomerative Clustering Methods
Final step will merge clusters (1,2) with (3,4,5).
23
Methods – Agglomerative Clustering Methods
Ward’s Minimum Variance At each stage, the objective is to fuse two clusters based on keeping variance, or within-cluster sum of squares, small.
24
Methods – Agglomerative Clustering Methods
i.e., want to minimize the increase in E, where, is the mean of the mth cluster for the kth variable. xml,k is the score on the kth variable for the lth object in the mth cluster.
25
Methods – Agglomerative Clustering Methods
Tends to find same size, spherical clusters. Sensitive to outliers. Most widely used agglomerative technique.
26
Methods – Divisive Clustering Methods
Can be computationally demanding if all 2k-1 – 1 possible divisions into two subclusters of a cluster of k objects are considered at each stage. Less commonly used than agglomerative methods
27
Methods – Hierarchical Clustering of Motivating Example
Used an Euclidean distance matrix and Ward’s minimum variance technique.
28
Methods – Optimization Clustering
Assumes number of clusters has already been fixed by the investigator. Basic idea: associated with each partition of the n individuals in the required number of groups, g, is an adequacy index c(n,g). This index is used to compare partitions.
29
Methods – Optimization Clustering
Concepts of homogeneity and separation can be used to develop the adequacy index. Homogeneity: objects within a group should have a cohesive structure. Separation: groups should be well isolated from each other.
30
Methods – Optimization Clustering Criteria
Decompose the total dispersion matrix, T, given by into T = W + B.
31
Methods – Optimization Clustering Criteria
W is the within-group dispersion matrix, given by B is the between-group dispersion matrix, given by
32
Methods – Optimization Clustering Criteria
Minimize trace(W) Equivalent to maximizing the trace(B). Maximizes the sum of the squared Euclidean distances between individuals and their group mean. Also known as the k-means algorithm. Not scale-invariant and tends to find spherical clusters.
33
Methods – Optimization Clustering Criteria
Minimize det(W) Actually want to maximize det(T)/det(W), but T is the same for all possible partitions of n individuals into g groups. Can identify elliptical clusters and is scale-invariant. Tends to produce clusters that have an equal number of objects and are the same shape.
34
Methods – Optimization Clustering Criteria
(a) Trace (W) (b) Det (W)
35
Methods – Optimization Clustering Criteria
Minimize Wm is the dispersion matrix within the mth group, given by Can produce clusters of different shapes. Not often used.
36
Methods – Optimization Clustering of Motivating Example
37
Methods – Optimization Clustering of Motivating Example
38
Methods – Optimization Clustering of Motivating Example
39
Methods – Choosing the Optimal Number of Clusters
Plot clustering criteria against number of groups and look for large changes in plot. Choose g to maximize C(g), where Choose g to minimize g2det(W).
40
Methods – Choosing the Optimal Number of Clusters
Hypothesis tests Let J12(m) be the within-cluster sum of squares of the mth cluster. Let J22(m) be the WSS when the mth cluster is optimally divided in two. Reject the null hypothesis that the mth cluster is homogeneous if L(m) exceeds the critical value of a standard normal, where
41
Methods – Choosing the Optimal Number of Clusters
Let Sg2 be the sum of squared deviations from cluster centroids. A division of the n objects into g2 clusters is significantly better than g1 clusters (g2>g1) if F*(g1,g2) exceeds the critical value of a F distribution with degrees of freedom p(g2-g1) and p(n-g2), where
42
The End
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.