Clustering
Clustering Techniques Partitioning methods Hierarchical methods Density-based Grid-based Model-based
Types of Data Data Matrix x11 … x1f … x1p . . . xi1 … xjf … xip . . . . . . xi1 … xjf … xip . . . xj1 … xjf … xjp Dissimilarity Matrix d(2,1) 0 d(3,1) d(3,2) 0 . . . d(n,1) d(n,2) … 0 d(i, j) – difference or dissimilarity between objects
Interval Scaled Variables Standardization mean absolute deviation sf sf = 1/n ( |x1f - mf |+ |x2f - mf | + … + |xnf - mf | ) standardized measurement zif = x1f - mf / sf Compute dissimilarity between objects Euclidean distance d(i, j) = √ |xi1 - xj1 |2+ | xi2 - xj1|2 + … + | xip - xjp|2 Manhattan (city-block) distance d(i, j) = |xi1 - xj1 |+ | xi2 - xj1| + … + | xip - xjp| Minkowski distance d(i, j) = ( |xi1 - xj1 |p+ | xi2 - xj1|p + … + | xip - xjp|p )1/p
Binary Variables There are only two states: 0 (absent) or 1 (present). Ex. smoker: yes or no. Computing dissimilarity between binary variables: Dissimilarity matrix (contingency table) if all attributes have the same weight Object i 1 Sum q r q+r s t s+t q+s r+t p d(i, j) = r+s / q+r+s asymmetric attributes d(i, j) = r+s / q+r+s+t symmetric attributes
Nominal Variables Ordinal Variables Nominal Variables Generalization of binary variable where it can take more than two states. Ex. color: red, green, blue. d(i, j) = p - m / p m – number of matches p – total number of attributes Weights can be used: assign greater weight to the matches in variables having a larger number of states. Ordinal Variables Resemble nominal variables except the states are ordered in meaningful sequence. Ex. medal: gold, silver, bronze. Replace xif by rif {1, …, Mf} The value of f for the ith object is xif, and f has Mf ordered states, representing the ranking 1, …, Mf. Replace each xif by its corresponding rank.
Variables of Mixed Types p (f) (f) ij dij f=1 d(i, j) = p (f) ij (f) where the indicator ij = 0 if either xif or xjf is missing, or xif = xjf = 0 and variable f is asymmetric binary; otherwise ij = 1. The contribution of variable f to the dissimilarity is dependent on its type: (f) (f) If f is binary or nominal: dij = 0 if xif = xif; otherwise dij = 1. (f) |xif- xjf| If f is interval-based: dij = , where h runs maxhxhf – mixhxhf over all non missing objects for variable f. If f is ordinal or ratio-scaled: compute the ranks rif and rif-1 zif = Mf - 1 and treat zif as interval-scaled.
Clustering Methods 1. Partitioning (k-number of clusters) 2. Hierarchical (hierarchical decomposition of objects) TV – trees of order k Given: set of N - vectors Goal: divide these points into maximum I disjoint clusters so that points in each cluster are similar with respect to maximal number of coordinates (called active dimensions). TV-tree of order 2: (two clusters per node) Procedure: Divide set of N points into Z clusters maximizing the total number of active dimensions. For each cluster repeat the same procedure. Density-based methods Can find clusters of arbitrary shape. Can grow (given cluster) as long as density in the neighborhood exceeds some threshold (for each point, neighborhood of given radius contains minimum some number of points).
Squared error criterion Partitioning methods 1. K-means method (n objects to k clusters) Cluster similarity measured in regard to mean value of objects ina cluster (cluster’s center of gravity) Select randomly k-points (call them means) Assign each object to nearest mean Compute new mean for each cluster Repeat until criterion function converges K E = | p - mi |2 i=1 pCi This method is sensitive to outliers. 2. K-medoids method Instead of mean, take a medoid (most centrally located object in a cluster) Squared error criterion We try to minimize
Hierarchical Methods Agglomerative hierarchical clustering (bottom-up strategy) Each object placed in a separate cluster, and then we merge these clusters until certain termination conditions are satisfied. Divisive hierarchical clustering (top-down strategy) Distance between clusters: Minimum distance: dmin(Ci, Cj) = minpCi , p’Cj | p – p’ | Maximum distance: dmax(Ci, Cj) = maxpCi , p’Cj | p – p’ | Mean distance: dmean(Ci, Cj) = | mi – mj | Average distance: davg(Ci, Cj) = 1/ninj pCi p’Cj | p – p’ |
Distance Between Clusters Cluster: Km = {tm1, … , tmn} N Centroid: Cm = tmi / N i=1 N Radius: Rm = (tmi - Cm)2 / N N N Diameter: Dm = (tmi - tmj)2 / N (N-1) i=1 j=1 Distance Between Clusters Single Link (smallest distance) Dis(Ki , Kj) = min{Dis(ti , tj) : ti Ki , tj Kj } Complete Link (largest distance) Dis(Ki , Kj) = max{Dis(ti , tj) : ti Ki , tj Kj } Average Dis(Ki , Kj) = mean{Dis(ti , tj) : ti Ki , tj Kj } Centroid Distance Dis(Ki , Kj) = Dis(Ci , Cj) Cluster 1 Cluster 2
Distances (threshold) Hierarchical Algorithms Single Link Technique (find maximal connected components in a graph) Distances (threshold) A B C D E 3 1 2 4 5 A B C D 1 A B C D 1 2 Dendogram A B C D E 1 2 3 Threshold level
Complete Link Technique Complete Link Technique (looks for cliques – maximal graphs in which there is an edge between any two vertices) Distances (threshold) 1 2 3 4 …. B C A D 1 E 3 A B C D 1 E Dendogram E A B D C 1 3 5 (5, {EABCD}) (3, {EAB}, {DC}) (1, {AB}, {DC}, {E}) (0, {E}, {A}, {B}, {C}, {D})
Partitioning Algorithms Minimum Spanning Tree (MST) Given: n – points k – clusters Algorithm: Start with complete graph Remove largest inconsistent edge (its weight is much larger than average weight of all adjacent edges) Repeat 5 6 10 12 80
Squared Error Cluster: Ki = {ti1, … , tin} Center of cluster: Ci N Squared Error Cluster: Ki = {ti1, … , tin} Center of cluster: Ci N Squared Error: SEKi = ||tij – Ci||2 j=1 Collection of clusters: K = {K1, … , Kk} k Squared Error for K: SEk = SEKi i=1 Given: k – number of clusters threshold Algorithm: Repeat Choose k points randomly (called centers) Assign each item to the cluster which has the closest center Calculate new center for each cluster Calculate squared error Until Difference between old error and new one is below specified threshold
CURE (Clustering Using Representatives) CURE (Clustering Using Representatives) Idea: handling clusters of different shapes Algorithm: Constant number of points are chosen from each cluster These points are shrunk toward the cluster’s centroid Clusters with closest pair of representative points are merged Center
Examples related to clustering