Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ
Adriano Cruz *NCE e IM - UFRJ Cluster 2 Clustering Validity The number of clusters is not always previously known. The number of clusters is not always previously known. In many problems the number of classes is known but it is not the best configuration. In many problems the number of classes is known but it is not the best configuration. It is necessary to study methods to indicate and/or validate the number of classes. It is necessary to study methods to indicate and/or validate the number of classes.
Adriano Cruz *NCE e IM - UFRJ Cluster 3 Clustering Validity Example 1 Consider the problem of number recognition Consider the problem of number recognition It is known that there are 10 classes (10 digits) It is known that there are 10 classes (10 digits) The number of clusters, however, may be greater than 10 The number of clusters, however, may be greater than 10 This is the result of different handwriting to the same digit This is the result of different handwriting to the same digit
Adriano Cruz *NCE e IM - UFRJ Cluster 4 Clustering Validity Example 2 Consider the problem segmentation of thermal image in a room Consider the problem segmentation of thermal image in a room It is known that there are 2 classes of temperatures: body and room temperatures It is known that there are 2 classes of temperatures: body and room temperatures This is a problem where the number of classes is well defined. This is a problem where the number of classes is well defined.
Adriano Cruz *NCE e IM - UFRJ Cluster 5 Clustering Validity Problem First data is partitioned in different number of clusters First data is partitioned in different number of clusters It is also important to try different initial conditions to the same number of partitions It is also important to try different initial conditions to the same number of partitions Validity measures are applied to these partitions to estimate their quality Validity measures are applied to these partitions to estimate their quality It is necessary to estimate the quality when the number of partitions is changed and, for the same number, when the initial conditions are different It is necessary to estimate the quality when the number of partitions is changed and, for the same number, when the initial conditions are different
Clustering Validity L-Clusters
Adriano Cruz *NCE e IM - UFRJ Cluster 7 Initial Definitions d(e i,e k ) is the dissimilarity between element e i and e k. d(e i,e k ) is the dissimilarity between element e i and e k. Euclidean distance is an example of an measure of dissimilarity Euclidean distance is an example of an measure of dissimilarity
Adriano Cruz *NCE e IM - UFRJ Cluster 8 L–Cluster Definition C is an L-cluster if for each object e i belonging to C: C is an L-cluster if for each object e i belonging to C: e k C, max d(e i,e k )< e h C, min d(e i,e h ) e k C, max d(e i,e k )< e h C, min d(e i,e h ) Maximum distance between any element e i and any element e k is smaller than the minimum distance between e i and any e h from another cluster. Maximum distance between any element e i and any element e k is smaller than the minimum distance between e i and any e h from another cluster.
Adriano Cruz *NCE e IM - UFRJ Cluster 9L-cluster C
Adriano Cruz *NCE e IM - UFRJ Cluster 10 L* – Definition C is an L*-cluster if for each object e i belonging to C: C is an L*-cluster if for each object e i belonging to C: e k C, max d(e i,e k ) < e l C, e h C, min d(e l,e h ) e k C, max d(e i,e k ) < e l C, e h C, min d(e l,e h )
Adriano Cruz *NCE e IM - UFRJ Cluster 11L*-cluster C
Clustering Validity Silhouettes
Adriano Cruz *NCE e IM - UFRJ Cluster 13Introduction Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. P.J. Rousseeuw, 1987 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. P.J. Rousseeuw, 1987 Each cluster is represented by one silhouette, showing which objects lie well within the cluster. Each cluster is represented by one silhouette, showing which objects lie well within the cluster. The user can compare the quality of the clusters The user can compare the quality of the clusters
Adriano Cruz *NCE e IM - UFRJ Cluster 14 Method - I Consider a cluster A. Consider a cluster A. For each element e i A calculate the average dissimilarity to all other objects of A, a(e i ) = d(e i,A). For each element e i A calculate the average dissimilarity to all other objects of A, a(e i ) = d(e i,A). Therefore, A can not be a singleton. Therefore, A can not be a singleton. Euclidean distance is an example of dissimilarity. Euclidean distance is an example of dissimilarity.
Adriano Cruz *NCE e IM - UFRJ Cluster 15 Method - II Consider all clusters C k different from A. Consider all clusters C k different from A. Calculate d k (e i,C k ), the average dissimilarity of e i to all elements of C k. Calculate d k (e i,C k ), the average dissimilarity of e i to all elements of C k. Select b(e i ) = min (d k (e i,C k )). Select b(e i ) = min (d k (e i,C k )). Let us call B the cluster whose dissimilarity is b(e i ). Let us call B the cluster whose dissimilarity is b(e i ). This is the second-best choice for e i This is the second-best choice for e i
Adriano Cruz *NCE e IM - UFRJ Cluster 16 Method - III The silhouette s(e i ) is equal to The silhouette s(e i ) is equal to s(e i ) = 1–[a(e i ) / b(e i )]se a(e i ) < b(e i ). s(e i ) = 1–[a(e i ) / b(e i )]se a(e i ) < b(e i ). s(e i ) = 0 se a(e i ) = b(e i ). s(e i ) = 0 se a(e i ) = b(e i ). s(e i ) = [b(e i ) / a(e i )] - 1 se a(e i ) > b(e i ). s(e i ) = [b(e i ) / a(e i )] - 1 se a(e i ) > b(e i ). ou ou s(e i ) = [b(e i ) - a(e i )] / max (b(e i ),a(e i )) s(e i ) = [b(e i ) - a(e i )] / max (b(e i ),a(e i )) -1 <= s(e i ) <= <= s(e i ) <= +1
Adriano Cruz *NCE e IM - UFRJ Cluster 17 Understanding s(e i ) s(e i ) 1: within dissimilarity a(e i ) << b(e i ), e i is well classified. s(e i ) 1: within dissimilarity a(e i ) << b(e i ), e i is well classified. s(e i ) 0: a(e i ) b(e i ), e i may belong to either cluster. s(e i ) 0: a(e i ) b(e i ), e i may belong to either cluster. s(e i ) -1: within dissimilarity a(e i )>>b(e i ), e i is misclassified, should belong to B. s(e i ) -1: within dissimilarity a(e i )>>b(e i ), e i is misclassified, should belong to B.
Adriano Cruz *NCE e IM - UFRJ Cluster 18Silhouette The silhouette of the cluster A is the plot of all s(e i ) ranked in decreasing order. The silhouette of the cluster A is the plot of all s(e i ) ranked in decreasing order. The average of all s(e i ) of all elements in the cluster is called the average silhouette. The average of all s(e i ) of all elements in the cluster is called the average silhouette.
Adriano Cruz *NCE e IM - UFRJ Cluster 19 Example of use I QTY = 100; X = [randn(QTY,2)+0.5*ones(QTY,2);randn(QTY,2) *ones(QTY,2)]; - 0.5*ones(QTY,2)]; opts = statset('Display','final'); [cidx, ctrs] = kmeans(X, 2, 'Distance','city',... 'Replicates',5, 'Options',opts); 'Replicates',5, 'Options',opts);figure; plot(X(cidx==1,1),X(cidx==1,2),'r.',... X(cidx==2,1),X(cidx==2,2),... X(cidx==2,1),X(cidx==2,2),... 'b.', ctrs(:,1),ctrs(:,2),'kx'); 'b.', ctrs(:,1),ctrs(:,2),'kx');figure; [s, h] = silhouette(X, cidx, 'sqeuclid');
Adriano Cruz *NCE e IM - UFRJ Cluster 20 Ex Silhouette 1
Adriano Cruz *NCE e IM - UFRJ Cluster 21 Ex Silhouette 2
Adriano Cruz *NCE e IM - UFRJ Cluster 22 Example of use I I QTY = 100; X = [randn(QTY,2)+2*ones(QTY,2);randn(QTY,2) *ones(QTY,2)]; - 2*ones(QTY,2)]; opts = statset('Display','final'); [cidx, ctrs] = kmeans(X, 2, 'Distance','city',... 'Replicates',5, 'Options',opts); 'Replicates',5, 'Options',opts);figure; plot(X(cidx==1,1),X(cidx==1,2),'r.',... X(cidx==2,1),X(cidx==2,2),... X(cidx==2,1),X(cidx==2,2),... 'b.', ctrs(:,1),ctrs(:,2),'kx'); 'b.', ctrs(:,1),ctrs(:,2),'kx');figure; [s, h] = silhouette(X, cidx, 'sqeuclid');
Adriano Cruz *NCE e IM - UFRJ Cluster 23 Ex silhouette 3
Adriano Cruz *NCE e IM - UFRJ Cluster 24 Ex silhouette 4
Cluster Validity Partition Coefficient
Adriano Cruz *NCE e IM - UFRJ Cluster 26 Partition Coefficient This coefficient is defined as This coefficient is defined as
Adriano Cruz *NCE e IM - UFRJ Cluster 27 Partition Coefficient comments F is inversely proportional to the number of clusters. F is inversely proportional to the number of clusters. F is not appropriated to find the best number of partitions F is not appropriated to find the best number of partitions F is best suited to validate the best partition among those with the same number of clusters F is best suited to validate the best partition among those with the same number of clusters
Adriano Cruz *NCE e IM - UFRJ Cluster 28 Partition Coefficient When F=1/c the system is entirely fuzzy, since every element belongs to all clusters with the same degree of membership When F=1/c the system is entirely fuzzy, since every element belongs to all clusters with the same degree of membership When F=1 the system is rigid and membership values are either 1 or 0. When F=1 the system is rigid and membership values are either 1 or 0. This measurement can only be applied to fuzzy partitions This measurement can only be applied to fuzzy partitions
Adriano Cruz *NCE e IM - UFRJ Cluster 29 Partition Coefficient Example The Partition Matrix is The Partition Matrix is w1 w2 w3
Adriano Cruz *NCE e IM - UFRJ Cluster 30 Partition Coefficient Example The Partition Matrix is The Partition Matrix is w1 w2 w3 w4
Adriano Cruz *NCE e IM - UFRJ Cluster 31 Partition Coefficient Example The Partition Matrix is The Partition Matrix is X1X2X3 X4X5X6
Cluster Validity Partition Entropy
Adriano Cruz *NCE e IM - UFRJ Cluster 33 Partition Entropy Partition Entropy is defined as Partition Entropy is defined as When H=0 the partition is rigid. When H=0 the partition is rigid. When H=log(c) the fuzziness is maximum. When H=log(c) the fuzziness is maximum. 0 <= 1-F <= H 0 <= 1-F <= H
Adriano Cruz *NCE e IM - UFRJ Cluster 34 Partition Entropy comments Partition Entropy (H) is directly proportional to the number of partitions. Partition Entropy (H) is directly proportional to the number of partitions. H is more appropriated to validate the best partition among several runs of an algorithm. H is more appropriated to validate the best partition among several runs of an algorithm. H is strictly a fuzzy measure H is strictly a fuzzy measure
Cluster Validity Compactness and Separation
Adriano Cruz *NCE e IM - UFRJ Cluster 36 Compactness and Separation CS is defined as CS is defined as J m is the objective function minimized by the FCM algorithm. J m is the objective function minimized by the FCM algorithm. n is the number of elements. n is the number of elements. d min is minimum Euclidean distance between the center of two clusters. d min is minimum Euclidean distance between the center of two clusters.
Adriano Cruz *NCE e IM - UFRJ Cluster 37 Compactness and Separation The minimum distance is defined as The minimum distance is defined as The complete formula is The complete formula is
Adriano Cruz *NCE e IM - UFRJ Cluster 38 Compactness and Separation This a very complete validation measure. This a very complete validation measure. It validates the number of clusters and the checks the separation among clusters. It validates the number of clusters and the checks the separation among clusters. From our experiments it works well even when the degree of superposition is high. From our experiments it works well even when the degree of superposition is high.
Cluster Validity Fuzzy Linear Discriminant
Adriano Cruz *NCE e IM - UFRJ Cluster 40 Fischer Linear Discriminant The Fishers Linear Discriminant (FLD) is an important technique used in pattern recognition problems to evaluate the compactness and separation of the partitions produced by crisp clustering techniques. The Fishers Linear Discriminant (FLD) is an important technique used in pattern recognition problems to evaluate the compactness and separation of the partitions produced by crisp clustering techniques.
Adriano Cruz *NCE e IM - UFRJ Cluster 41 Fischer Linear Discriminant It is easier to handle classification problems in which sampled data has few characteristics It is easier to handle classification problems in which sampled data has few characteristics So it is important to reduce the problem dimensionality So it is important to reduce the problem dimensionality When FLD is applied to a space crisply partitioned it produces an operator (W) that maps the original set (R p ) into a new set (R k ), where k<p When FLD is applied to a space crisply partitioned it produces an operator (W) that maps the original set (R p ) into a new set (R k ), where k<p
Adriano Cruz *NCE e IM - UFRJ Cluster 42 Fischer Linear Discriminant W x1 x2 Figura. – Projeção de amostras dispostas em 2 classes em uma reta feita pelo Discriminante Linear de Fisher
Adriano Cruz *NCE e IM - UFRJ Cluster 43FLD FLD measures the compactness and separation of all categories when crisp partitions are created FLD measures the compactness and separation of all categories when crisp partitions are created FLD uses two matrices: FLD uses two matrices: S B : Between Classes Scatter Matrix S B : Between Classes Scatter Matrix S W : Within Classes Scatter Matrix S W : Within Classes Scatter Matrix
Adriano Cruz *NCE e IM - UFRJ Cluster 44 FLD – S B Matrix Measures the quality of separation between classes
Adriano Cruz *NCE e IM - UFRJ Cluster 45 FLD – S B Matrix m is the average of all samples m i is the average of all samples belonging to cluster i n is the number of samples n i is the number of samples belonging to cluster i
Adriano Cruz *NCE e IM - UFRJ Cluster 46 FLD – S W Matrix Measures the compactness of all classes Measures the compactness of all classes It is the sum of all internal scattering It is the sum of all internal scattering
Adriano Cruz *NCE e IM - UFRJ Cluster 47 Total Scattering The total scattering is the sum of the internal scattering and the scattering between the classes The total scattering is the sum of the internal scattering and the scattering between the classes S T =S W +S B S T =S W +S B In an optimal partition the separation between classes (S B ) must be maximum and within the classes minimum (S W ) In an optimal partition the separation between classes (S B ) must be maximum and within the classes minimum (S W )
Adriano Cruz *NCE e IM - UFRJ Cluster 48 J criteria Fisher defined the J criteria that must be maximized Fisher defined the J criteria that must be maximized A simplified way to evaluate J is A simplified way to evaluate J is
Adriano Cruz *NCE e IM - UFRJ Cluster 49 J comments J may vary in the interval 0<=J<= J may vary in the interval 0<=J<= J is strictly rigid J is strictly rigid J looses precision as the sample overlapping increases J looses precision as the sample overlapping increases
Adriano Cruz *NCE e IM - UFRJ Cluster 50EFLD EFLD measures the compactness and separation of all categories when fuzzy partitions are created EFLD measures the compactness and separation of all categories when fuzzy partitions are created EFLD uses two matrices: EFLD uses two matrices: S Be : Between Classes Scatter Matrix S Be : Between Classes Scatter Matrix S We : Within Classes Scatter Matrix S We : Within Classes Scatter Matrix
Adriano Cruz *NCE e IM - UFRJ Cluster 51 EFLD – S Be Matrix Measures the quality of separation between classes Measures the quality of separation between classes
Adriano Cruz *NCE e IM - UFRJ Cluster 52 EFLD – S We Matrix Measures the compactness of all classes Measures the compactness of all classes It is the sum of all internal scattering It is the sum of all internal scattering
Adriano Cruz *NCE e IM - UFRJ Cluster 53 Total Scattering The total scattering is the sum of the internal scattering and the scattering between the classes The total scattering is the sum of the internal scattering and the scattering between the classes S Te =S We +S Be In an optimal partition the separation between classes (S Be ) must be maximum and within the classes minimum (S We ) In an optimal partition the separation between classes (S Be ) must be maximum and within the classes minimum (S We )
Adriano Cruz *NCE e IM - UFRJ Cluster 54 J e criteria J e : criteria that must be maximised J e : criteria that must be maximised A simplified way to evaluate J e is A simplified way to evaluate J e is
Adriano Cruz *NCE e IM - UFRJ Cluster 55 Simplifying J e criteria A simplified way to evaluate J e A simplified way to evaluate J e It can be proved that S T is constant and equal to It can be proved that S T is constant and equal to
Adriano Cruz *NCE e IM - UFRJ Cluster 56 J e comments J e may vary in the interval 0<=J e <= J e may vary in the interval 0<=J e <= J e is strictly rigid J e is strictly rigid J e looses precision as the sample overlapping increases J e looses precision as the sample overlapping increases
Adriano Cruz *NCE e IM - UFRJ Cluster 57 Applying EFLD EFLD Número de Categorias Amostras X14,68154,91360,29430,25590,3157 Amostras X2 0,32710,85890,87570,96081,0674
Cluster Validity Inter Class Contrast
Adriano Cruz *NCE e IM - UFRJ Cluster 59Comments EFLD EFLD Increases as the number of clusters rises. Increases as the number of clusters rises. Increases when classes have high degree of overlapping. Increases when classes have high degree of overlapping. Reaches maximum for a wrong number of clusters. Reaches maximum for a wrong number of clusters.
Adriano Cruz *NCE e IM - UFRJ Cluster 60ICC Evaluates a crisp and fuzzy clustering algorithms Evaluates a crisp and fuzzy clustering algorithms Measures: Measures: Partition Compactness Partition Compactness Partition Separation Partition Separation ICC must be Maximized ICC must be Maximized
Adriano Cruz *NCE e IM - UFRJ Cluster 61ICC s Be – estimates the quality of the placement of the centres. s Be – estimates the quality of the placement of the centres. 1/n – scale factor 1/n – scale factor Compensates the influence of the number of points in s Be Compensates the influence of the number of points in s Be
Adriano Cruz *NCE e IM - UFRJ Cluster 62 ICC - 2 D min – minimum Euclidian distance between all pairs of centres D min – minimum Euclidian distance between all pairs of centres Neutralizes the tendency of s Be to grow, avoiding the maximum being reached for a number of clusters greater than the ideal value. Neutralizes the tendency of s Be to grow, avoiding the maximum being reached for a number of clusters greater than the ideal value. When 2 or more clusters represent a class – D min decreases abruptly When 2 or more clusters represent a class – D min decreases abruptly
Adriano Cruz *NCE e IM - UFRJ Cluster 63 ICC Fuzzy Application Five classes with 500 points each Five classes with 500 points each No class overlapping No class overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = Apply FCM for m = 2 and c =
Adriano Cruz *NCE e IM - UFRJ Cluster 64 ICC Fuzzy Application Results
Adriano Cruz *NCE e IM - UFRJ Cluster 65 ICC Fuzzy Application Time 0, ,00490,00450,0061FPI 0, ,00490,00450,0044F 0, ,00580,00560,0061NFI 0,04760,03820,02610,0226CS 2,01601,55101,13920,7800EFLDDet 1,89821,47801,08700,7678EFLDTra EFLD 0,01320,01100,00880,0110ICCDet 0,01100,00880,00600,0078ICCTra 0, ,00820,00690,0061ICC 5432 Number of Categories Time
Adriano Cruz *NCE e IM - UFRJ Cluster 66 Application with Overlapping Five classes with 500 points each Five classes with 500 points each High cluster overlapping High cluster overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = Apply FCM for m = 2 and c =
Adriano Cruz *NCE e IM - UFRJ Cluster 67 Application Overlapping Results
Adriano Cruz *NCE e IM - UFRJ Cluster 68 Application Time Results 0, ,03190,02710,0167MPE 0,01640,00610,01210,0112F 0, ,03620,02830,0220CS 1,84501,60901,25800,9720EFLDDet 2,25841,75982,10380,7930EFLDTra EFLD 0,01200,01100,00780,0110ICCDet 0,01100,00980,00600,0066ICCTra 0, ,00770,00640,0060ICC 5432 Number of Clusters Time
Adriano Cruz *NCE e IM - UFRJ Cluster 69 ICC conclusions Fast and efficient Fast and efficient Works with fuzzy and crisp partitions Works with fuzzy and crisp partitions Efficient even with high overlapping clusters Efficient even with high overlapping clusters High rate of right results High rate of right results