Download presentation
Presentation is loading. Please wait.
Published byBethanie Stone Modified over 9 years ago
1
Cluster analysis 포항공과대학교 산업공학과 확률통계연구실 이 재 현
2
POSTECH IE PASTACLUSTER ANALYSIS Definition Cluster analysis is a technigue used for combining observations into groups or clusters such that Each group or cluster is homogeneous or compact with respect to certain characteristics Each group should be different from other groups with respect to the same characteristics Example A marketing manager is interested in identifying similar cities that can be used for test marketing The campaign manager for a political candidate is interested in identifying groups of votes who have similar views on important issues
3
POSTECH IE PASTACLUSTER ANALYSIS Objective of clustering analysis The objective of cluster analysis is to group observation into clusters such that each cluster is as homogeneous as possible with respect to the clustering variables overview of cluster analysis step 1 ; n objects measured on p variables step 2 ; Transform to n * n similarity(distance) matrix step 3 ; Cluster formation (Hierarchical or nonhierarchical clusters) step 4 ; Cluster profile
4
POSTECH IE PASTACLUSTER ANALYSIS Key problem Measure of similarity Fundamental to the use of any clustering technique is the computation of a measure of similarity to distance between the respective objects. Distance-type measures – Euclidean distance for standardized data, Mahalanobis distance Matching-type measures – Association coefficients, correlation coefficients A procedure for forming the clusters Hierarchical clustering – Centroid method, Single-linkage method, Complete- linkage method, Average-linkage method, Ward’s method. Nonhierarchical clustering – k-means clustering
5
POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Distance type Minkowski metric If r = 2, then Euclidean distance if r = 1, then absolute distance consider below example Data SubjectIncomeEducation S155 S266 S31514 S41615 S52520 S63019 Similarity matrix S1S2S3S4S5S6 S10.002.00181.00221.00625.00821.00 S22.000.00145.00181.00557.00745.00 S3181.00145.000.002.00136.00250.00 S4211.00181.002.000.00106.00212.00 S5625.00557.00136.00106.000.0026.00 S6821.00745.00250.00212.0026.000.00 Person Weight in Pounds Height in Feet A1605.5 B1636.2 C1636.0 Height in FeetHeight in inches d AB = 3.08d AB = 8.92 d AC = 5.02d AC = 7.81 d BC = 2.01d BC = 3.12
6
POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Distance type Euclidean distance for standardized data To make scale invariant data The squared euclidean distance is weighted by Mahalanobis distance x is p*1 vector, S is a p*p covariance matrix It is designed to take into account the correlation among the variables and is also scale invariant. Similarity matrix S1S2S3S4S5S6 S10.000.353.003.689.5511.09 S20.350.002.383.008.459.94 S33.002.380.000.0351.892.87 S43.683.000.0350.001.432.36 S59.558.451.891.430.000.28 S611.099.942.872.360.280.00
7
POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Matching type Association coefficients This type of measure is used to represent similarity for binary variables Similarity coefficients Attribute Person123456 A011011 B101001 Person A Person B +- +213 -213 426
8
POSTECH IE PASTACLUSTER ANALYSIS Similarity Measure – Matching type Correlation coefficient Pearson product moment correlation coefficient is used for measure of similarity. d AB = 1, d AC = 0.82 PersonX1X1 X2X2 X3X3 X4X4 A1322 B41077 C1222
9
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Centroid method Each group is replaced by Average Subject which is the centroid of that group Data for five clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S315.014.0 3S416.015.0 4S525.020.0 5S630.019.0 Similarity matrix S1 & S2S3S4S5S6 S1 & S20.00162.50200.50590.50782.50 S3162.500.002.00135.96250.00 S4200.202.000.00106.00212.00 S5590.50135.96106.000.0026.00 S6782.50250.00212.0026.000.00 Data for four clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S415.514.5 3S516.015.0 4S625.020.0 Similarity matrix S1 & S2S3 & S4S5S6 S1 & S20.00181.00590.50782.50 S3 & S4181.000.00120.50230.50 S5590.50120.500.0026.00 S6782.50230.5026.000.00
10
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Data for three clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S415.514.5 3S5 & S627.519.5 Similarity matrix S1 & S2S3 & S4S5 & S6 S1 & S20.00181.00680.00 S3 & S4181.000.00169.00 S5 & S6680.00169.000.00
11
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Single-Linkage method The distance between two clusters is represented by the minimum of the distance between all possible pairs of subjects in the two clusters = 181 and = 145 = 221 and = 181 Similarity matrix S1 & S2S3S4S5S6 S1 & S20.00145.00181.00557.00745.00 S3145.000.002.00136.00250.00 S4181.002.000.00106.00212.00 S5557.00136.00106.000.0026.00 S6745.00250.00212.0026.000.00
12
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Complete-Linkage method The distance between two clusters is defined as the maximum of the distances between all possible pairs of observations in the two clusters = 181 and = 145 = 625 and = 557 Similarity matrix S1 & S2S3S4S5S6 S1 & S20.00181.00221.00625.00821.00 S3181.000.002.00136.00250.00 S4221.002.000.00106.00212.00 S5625.00136.00106.000.0026.00 S6821.00250.00212.0026.000.00
13
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Average-Linkage method The distance between two clusters is obtained by taking the average distance between all pairs of subjects in the two clusters and (181 + 145) / 2 = 163 Similarity matrix S1 & S2S3S4S5S6 S1 & S20.00163.00201.00591.00783.00 S3163.000.002.00136.00250.00 S4201.002.000.00106.00212.00 S5591.00136.00106.000.0026.00 S6783.00250.00212.0026.000.00
14
POSTECH IE PASTACLUSTER ANALYSIS Hierarchical clustering Ward’s method It forms clusters by maximizing within-clusters homogeneity. The within-group sum of squares is used as the measure of homogeneity. The Ward’s method tries to minimize the total within-group or within-cluster sums of squares
15
POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Root-mean-square standard deviation(RMSSTD)of the new cluster RMSSTD if the pooled standard deviation of all the variables forming the cluster. pooled variance = pooled SS for all the variables / pooled degrees of freedom for all the variables R-Squared(RS) RS is the ratio of SS b to SS t (SS t = SS b + SS w ) RS of CL2 is (701.166 – 184.000) / 701.166 = 0.7376 Within-Group sum of squares and degrees of freedom for cluster formed in steps 1,2,3,4 and 5 Within-Group Sum of SquaresDegrees of Freedom RMSSTD Step numberClusterIncomeEducationPooledIncomeEducationPooled 1CL50.500 1.0001120.70 2CL40.500 1.0001120.70 3CL312.5000.50013.0001122.60 4CL2157.00026.000183.0003365.52 5CL1498.333202.833701.16655108.37
16
POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Semipartial R-Squared (SPR) The sum of pooled SS w ’s of cluster joined to obtain the new cluster is called loss of homogeneity. If loss of homogeneity is large then the new cluster is obtained by merging two heterogeneous clusters. SPR is the loss of homogeneity due to combining two groups or clusters to form a new group or cluster. SPR of CL2 is (183 – (1 – 13)) / 701.166 = 0.241 Distance between clusters It is simply the euclidean distance between the centroids of the two clusters that are to be joined or merger and it is termed the centroid distance (CD) Data for three clusters Cluster Cluster members IncomeEducation 1S1 & S25.5 2S3 & S415.514.5 3S5 & S627.519.5
17
POSTECH IE PASTACLUSTER ANALYSIS Evaluating the cluster solution and determining the number of cluster Summary of the statistics for evaluating cluster solution StatisticConcept measuredComments RMSSTDHomogeneity of new clustersValue should be small SPRHomogeneity of merged clustersValue should be small RSHomogeneity of new clustersValue should be high CDHomogeneity of merged clustersValue should be small
18
POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering The data are divided into k partitions or groups with each partition representing a cluster. The number of clusters must be known a priori. Step 1. Select k initial cluster centroids or seeds, where k is number of clusters desired. 2. Assign each observation to the cluster to which it is the closest. 3. Reassign or reallocate each observation to one of the k clusters according to a predetermined stopping rule. 4. Stop if there is no reallocation of data points or if the reassignment satisfies the criteria set by the stopping rule. Otherwise go to Step 2. Difference the method used for obtaining initial cluster centroids or seeds the rule used for reassigning observations
19
POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 1 step 1. select the first k observation as cluster center 2. compute the centroid of each cluster 3. reassigned by computing the distance of each observation Initial cluster centroids Cluster Variable123 Income5615 Education5614 Distance from cluster centroid Observation123 Assign to cluster S!021811 S2201452 S318114503 S422118123 S56255571363 S68217452503 Reassignment of Observation Observatio n 123Previous Reassignm ent S!02416.2511 S220361.2522 S318114551.2533 S422118134.2533 S562555721.2533 S682199076.2533 Centroid of the three clusters Cluster Variable123 Income5621.5 Education5617.0
20
POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 2 step 1. select the first k observation as cluster center 2. seeds are replaced by remaining observation. 3. reassigned by computing the distance of each observation Distance from cluster centroid Observation123 Assign to cluster S!01816251 S221455571 S318101362 S422121062 S562513603 S6821250263 1.{1}, {2}, {3} 2.{1}, {2}, {3, 4} 3.{1, 2}, {5}, {3, 4} 4.{1, 2}, {5, 6}, {3, 4} Centroid of the three clusters Cluster Variable123 Income5.515.527.5 Education5.514.519.5 Reassignment of Observation Observatio n 123Previous Reassignm ent S!0.5200.5716.511 S20.5162.5644.511 S3162.50.5186.522 S4200.50.5152.522 S5590.5120.56.533 S660.5230.56.533
21
POSTECH IE PASTACLUSTER ANALYSIS Nonhierarchical clustering Algorithm 3 selecting the initial seeds Sum(i) be the sum of the values of the variables Minimizes the ESS Initial Assignment SubjectIncomeEducationSum(i)CiCi Assigned to cluster S1551011 S2661211 S315142922 S416153122 S525204533 S630194933 Centroid of the three clusters Cluster Variable123 Income5.515.527.5 Education5.514.519.5 Change in ESS = 3[(5-27.5) 2 + (5-19.5) 2 ]/2 – [(5-5.5) 2 + (5-5.5) 2 ]/2 increase decrease Reassignment of Observation Observatio n 123Previous Reassignm ent S!1074.5300.5-11 S2966.5243.75-11 S3279.5-243.522 S4228.5-300.7522 S5-177.5882.533 S6-585.51170.533
22
POSTECH IE PASTACLUSTER ANALYSIS Which clustering method is best Hierarchical methods advantage ; Do not require a priori knowledge of the number of clusters of the starting partition. disadvantage ; Once an observation is assigned to a cluster it cannot be reassigned to another cluster. Nonhierarchical methods The cluster centers or the initial partition has to be identified before the technique can proceed to cluster observations. The nonhierarchical clustering algorithms, in general, are very sensitive to the initial partition. k-mean algorithm and other nonhierarchical clustering algorithms perform poorly when random initial partitions are used. However, their performance is much superior when the results from hierarchical methods are used to form the initial partition. Hierarchical and nonhierarchical techniques should be viewed an complementary clustering techniques rather than as competing techniques.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.