Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 2 What is cluster analysis? = The process of grouping a set of physical or abstract objects into classes of similar objects. = The class label of each class is unknown. = Classification separates objects into classes when the labels are known.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis? cont. = Clustering is a form of learning by observations. = Neural Networks learn by examples. = Unsupervised learning.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 4Applications = In business helps to discover distinct groups of customers. = In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster. = Pre-processing step for classification. = Pattern recognition.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 5Requirements = Scalability: work with large databases. = Ability to deal with different types of attributes (not only interval based data). = Clusters of arbitrary shape, not only spherical. = Minimal requirements about domain. = Ability do deal with noisy data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 6 Requirements cont. = Insensitivity to the order of input records. = Work with samples of high dimensionality. = Constrained-based clustering = Interpretability and usability: results should be easily interpretable.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 7 Heuristic Clustering Techniques = Incomplete or heuristic clustering: geometrical methods or projection techniques. = Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions. = Heuristic methods based on visualisation are used to determine the clusters.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 8 Deterministic Crisp Clustering = Each datum will be assigned to only one cluster. = Each cluster partition defines a ordinary partition of the data set.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 9 Overlapping Crisp Clustering = Each datum will be assigned to at least one cluster. = Elements may belong to more than one cluster at various degrees.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 10 Probabilistic Clustering = For each element, a probabilistic distribution over the clusters is determined. = The distribution specifies the probability with which a datum is assigned to a cluster. = If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 11 Possibilistic Clustering = Degrees of membership or possibility indicate to what extent a datum belongs to the clusters. = Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 12 Hierarchical Clustering = Descending techniques: they divide the data into more fine-grained classes. = Ascending techniques: they combine small classes into more coarse-grained ones.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 13 Objective Function Clustering = An objective function assigns to each cluster partition a values that has to be optimised. = This is strictly an optimisation problem.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 14 Data Matrices = Data matrix: represents n objects with p characteristics. Ex. person = {age, sex, income,...} = Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 15Dissimilarities = Dissimilarity measures some form of distance between objects. = Clustering algorithms use dissimilarities to cluster data. = How can dissimilarities be measured?

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 16 Data Types = Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. = Binary variables have only two states. Ex. smoker, fever, client, owner. = Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 17 Data Types cont. = Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks. = Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 18 Interval-scaled variables = Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. = Interval-scaled variables are dependent on the units used. = Measurement unit can affect analysis, so standardisation may be used.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 19Standardisation = Converting original measurements to unitless values. = Attempts to give all variables the equal weight. = Useful when there is no prior knowledge of the data.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 20 Standardisation algorithm = Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. = The transformed scores will have a mean of zero and standard deviation of one. = It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 21 Standardisation algorithm = Consider n values of a variable x. = Calculate the mean value. = Calculate the standard deviation. = Calculate the z-score.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 22 Z-scores example

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 23 Real heights and ages charts

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 24 Z-scores for heights and ages

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 25 Data chart

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 26 How to calculate dissimilarities? = The most popular methods are based on the distance between pairs of objects. = Minkowski distance: = p is the number of characteristics = q is the distance type = q=2 (Euclides distance), q=1 (Manhattan)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 27Distances

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 28 Binary Variables = Binary variables have only two states. = States can be symmetric or asymmetric. = Binary variables are symmetric if both states are equally valuable. Ex. gender = When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 29 Contingency tables = Consider objects described by p binary variables

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 30 Symmetric Variables = Similarity based on symmetric variables is invariant. = Simple matching coefficient:

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 31 Asymmetric Variables = Similarity based on asymmetric variables is no invariant. = Two ones are more important than two zeros = Jacard coefficient:

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 32 Computing dissimilarities

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 33 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 34 Nominal Variables = A nominal variable is a generalisation of the binary variable. = A nominal variable can take more than two states = Ex. Marital status: married, single, divorced = Each state can be represented by a number o letter = There is no specific ordering

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 35 Computing dissimilarities = Consider two objects i and j, described by nominal variables = Each object has p characteristics = m is the number of matches

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 36 Binarising nominal variables = An nominal variable can encoded to create a new binary variable for each state = Example: = Marital state = {married, single, divorced} = Married: 1=yes – 0=no = Single: 1=yes – 0=no = Divorced: 1=yes – 0=no = Ex. Marital state = {married} = married = 1, single = 0, divorced = 0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 37 Ordinal variables = A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence = Ex. Bronze, silver and gold medals = Ex. Assistant, associate, full member

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 38 Computing dissimilarities = Consider n objects defined by a set of ordinal variables = f is one of these ordinal variables and have M f states.  These states define the ranking r f i {1,…, M f }.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 39 Steps to calculate dissimilarities  Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }. = Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation = Dissimilarity can be computed using distance measures of interval-scaled variables

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 40 Ratio-scaled variables = Variables on a non-linear scale, such as exponential = To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 41 Variables of mixed types = One technique is to bring all variables onto a common scale of the interval [0.0.1.0] = Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 42 = Dissimilarity is between i and j is Variables of mixed types

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 43 = The contribution of each variable is dependent on its type = f is binary or nominal: = f is interval-based: = f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 44 Clustering Methods = Partitioning = Hierarchical = Density-based = Grid-based = Model-based

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 45 Partitioning Methods = Given n objects creates k partitions. = Each partition must contain at least one element. = It uses an iterative relocation technique to improve partitioning. = Distance is the usual criterion.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 46 Partitioning Methods cont. = They work well for finding spherical-shaped clusters. = They are not efficient on very large databases. = K-means where each cluster is represented by the mean value of the objects in the cluster. = K-medoids where each cluster is represented by an object near the centre of the cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 47 Hierarchical Methods = Creates a hierarchical decomposition of the set = Agglomerative approaches start with each object forming a separate group = Merges objects or groups until all objects belong to one group or a termination condition occurs = Divisive approaches starts with all objects in the same cluster = Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 48 Density-based methods = Method creates clusters until the density in the neighbourhood exceeds some threshold = Able to find clusters of arbitrary shapes

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 49 Grid-based methods = Grid methods divide the object space into finite number of cells forming a grid- like structure = Fast processing time, independent of the number of objects

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 50 Model-based methods = Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model =

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 51 Partition methods = Given a database of n objects a partition method organises them into k clusters (k<= n) = The methods try to minimise an objective function such as distance = Similar objects are “close” to each other

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 52 K-means algorithm = Based on the Euclidean distances among elements of the cluster = Centre of the cluster is the mean value of the objects in the cluster. = Classifies objects in a hard way. Each object belongs to a single cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 53 K-means algorithm = Consider n (X={x1, x2,..., xn}) objects and k clusters. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x im ). = Consider A a set of k clusters (A={A 1, A 2,..., A k }).

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 54 K-means properties = The union of all clusters makes the Universe = No element belongs to more than one cluster = There is no empty cluster

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 55 Membership function

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 56 Membership matrix U = Matrix containing the values of inclusion of each element into each cluster (0 or 1). = Matrix has k (clusters) lines and n (elements) columns. = The sum of all elements in the column must be equal to one (element belongs only to one cluster = The sum of each line must be less than n e grater than 0. No empty cluster, or cluster containing all elements.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 57 Matrix examples X1X2X3 X4X5X6 Two examples of clustering. What do the clusters represent?

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 58 Matrix examples cont. X1X2X3 X4X5X6 U1 and U2 are the same matrices.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 59 How many clusters? = The cardinality of any hard k-partition of n elements is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 60 How many clusters (example)? = Consider the matrix U2 (k=3, n=6)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 61 K-means inputs and outputs = Inputs: the number of clusters k and a database containing n objects with l characteristics each. = Output: A set of k clusters that minimises the square-error criterion.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 62 Number of Clusters

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 63 K-means algorithm v1 Arbitrarily assigns each object to a cluster (matrix U). Repeat Update the cluster centres; Update the cluster centres; Reassign objects to the clusters to which the objects are most similar; Reassign objects to the clusters to which the objects are most similar; Until no change;

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 64 K-means algorithm v2 Arbitrarily choose k objects as the initial cluster centres. Repeat Reassign objects to the clusters to which the objects are most similar. Reassign objects to the clusters to which the objects are most similar. Update the cluster centres. Until no change

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 65 Algorithm details = The algorithm tries to minimise the function = d ie is the distance between the element x e (m characteristics) and the centre of the cluster i (v i )

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 66 Cluster Centre = The centre of the cluster i (v i ) is l characteristics vector. = The jth co-ordinate is calculated as

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 67 Detailed Algorithm = Choose k (number of clusters).  Set error (  > 0) and step (r=0). = Arbitrarily set matrix U (r). Do not forget, each element belongs to a single cluster, no empty cluster and no cluster has all elements.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 68 Detailed Algorithm cont. Repeat Calculate the centre of the clusters v i (r) Calculate the distance d i (r) of each point to the centre of the clusters Generate U (r+1) recalculating all characteristic functions using the equations Until ||U (r+1) -U (r) || < 

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 69 Matrix norms = Consider a matrix U of n lines and n columns: = Column norm = Line norm

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 70 K-means problems? = Suitable when clusters are compact clouds well separated. = Scalable because computational complexity is O(nkr). = Necessity of choosing k is disadvantage. = Not suitable for nonconvex shapes. = It is sensitive to noise and outliers because they influence the means. = Depends on the initial allocation.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 71 Examples of results

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 72 K-medoids methods = K-means is sensitive to outliers since an object with an extremely large value may distort the distribution of data. = Instead of taking the mean value the most centrally object (medoid) is used as reference point. = The algorithm minimizes the sum of dissimilarities between each object and the medoid (similar to k-means)

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 73 K-medoids strategies = Find k-medoids arbitrarily. = Each remaining object is clustered with the medoid to which is the most similar. = Then iteratively replaces one of the medoids by a non-medoid as long as the quality of the clustering is improved. = The quality is measured using a cost function that measures the average dissimilarity between the objects and the medoid of its cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 74 Reassignment costs = Each time a reassignment occurs a difference in square-error J is contributed. = The cost function J calculates the total cost of replacing a current medoid by a non-medoid. = If the total cost is negative then m j is replaced by m random, otherwise the replacement is not accepted.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 75 Replacing medoids case 1 = Object p belongs to medoid m j. If m j is replaced by m random and p is closest to one of m i (i<>j), then reassigns p to m i mimi mjmj m random p

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 76 m random Replacing medoids case 2 = Object p belongs to medoid m j. If m j is replaced by m random and p is closest to one of m random, then reassigns p to m random mimi mjmj p

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 77 m random Replacing medoids case 3 = Object p belongs to medoid m i (i<>j). If m j is replaced by m random and p is still close to m i,then does not change. mimi mjmj p

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 78 m random Replacing medoids case 4 = Object p belongs to medoid m i (i<>j). If m j is replaced by m random and p is closest to m random,then reassigns p to m random. mimi mjmj p

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 79 K-medoid algorithm Arbitrarily choose k objects as the initial medoids. Repeat Assign each remaining object to the cluster with the nearest medoid; Randomly select a nonmedoid object, m random ; Compute the total cost J of swapping m j with m random ; If J<0 then swap o j with o random ; Until no change

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 80Comparisons? = K-medoids is more robust than k-means in presence of noise and outliers. = K-means is less costly in terms of processing time.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 81 Fuzzy C-means = Fuzzy version of K-means = Elements may belong to more than one cluster = Values of characteristic function range from 0 to 1.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 82 Fuzzy C-means setup = Consider n (X={x 1, x 2,..., x n }) objects and c clusters. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Consider A a set of k clusters (A={A 1, A 2,..., A k }).

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 83 Fuzzy C-means properties = The union of all clusters makes the Universe = There is no empty cluster

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 84 Membership function

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 85 Membership matrix U = Matrix containing the values of inclusion of each element into each cluster [0,1]. = Matrix has c (clusters) lines and n (elements) columns. = The sum of all elements in the column must be equal to one. = The sum of each line must be less than n e grater than 0. No empty cluster, or cluster containing all elements.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 86 Matrix examples X1X2X3 X4X5X6 Two examples of clustering. What do the clusters represent?

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 87 Fuzzy C-means algorithm v1 Arbitrarily assigns each object to a cluster (matrix U). Repeat Update the cluster centres; Update the cluster centres; Reassign objects to the clusters to which the objects are most similar; Reassign objects to the clusters to which the objects are most similar; Until no change;

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 88 Fuzzy C-means algorithm v2 Arbitrarily choose c objects as the initial cluster centres. Repeat Reassign objects to the clusters to which the objects are most similar. Reassign objects to the clusters to which the objects are most similar. Update the cluster centres. Until no change

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 89 Algorithm details = The algorithm tries to minimise the function, m is the nebulisation factor. = d ie is the distance between the element x e (l characteristics) and the centre of the cluster i (v i )

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 90 Nebulisation factor = m is the nebulisation factor. = This value has a range [1,  ) = If m=1 the the system is crisp. = If m  the all the membership values tend to 1/c = The most common values are 1.25 and 2.0

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 91 Cluster Centre = The centre of the cluster i (v i ) is a l characteristics vector. = The jth co-ordinate is calculated as

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 92 Detailed Algorithm = Choose k (number of clusters).  Set error (  > 0), nebulisation factor (m) and step (r=0). = Arbitrarily set matrix U (r). Do not forget, each element belongs to a single cluster, no empty cluster and no cluster has all elements.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 93 Detailed Algorithm cont. Repeat Calculate the centre of the clusters v i (r) Calculate the distance d i (r) of each point to the centre of the clusters Generate U (r+1) recalculating all characteristic functions(How?) Until ||U (r+1) -U (r) || < 

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 94 How to recalculate? If there is any distance equals to zero then the element belongs to this cluster and no one else. If there is any distance equals to zero then the element belongs to this cluster and no one else. Otherwise the membership grade is weighted average of the distances to all centers. Otherwise the membership grade is weighted average of the distances to all centers.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 95 Example of clustering result

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 96 Crisp K-NN = Supervised clustering method (Classification method). = Classes are defined before hand. = Classes are characterized by sets of elements. = The number of elements may differ among classes. = The main idea is to associate the sample to the class containing more neighbours.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 97 Crisp K-NN w2w2w2w2 w1w1w1w1 w3w3w3w3 w 13 w 10 w9w9w9w9 w4w4w4w4 w5w5w5w5 w 14 w 11 w 12 w7w7w7w7 w8w8w8w8 w6w6w6w6 s Class 1 Class 2 Class 3 Class 4 Class 5 = 3 nearest neighbours, and sample s is closest to pattern w 6 on class 5.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 98 Crisp K-NN algorithm = Consider X={x 1, x 2,..., x t } a set of t labelled data. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Input of y unclassified elements. = k the number of closest neighbours of y. = E the set of k nearest neighbours (NN).

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 99 Crisp K-NN algorithm set k {Calculating the NN} for i = 1 to t Calculate distance from y to x i if i<=k then add x i to E else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 100 K-NN algorithm cont. Determine the majority class represented in the set E and include y in this class. if there is a draw, then calculate the sum of distances from y to all neighbours in each class in the draw if the sums are different then add x i to class with smallest sum else add x i to class where last minimum was found

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 101 Fuzzy K-NN = The basis of the algorithm is to assign membership as a function of the object’s distance from its K-nearest neighbours and the memberships in the possible classes.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 102 Fuzzy K-NN = Consider X={x 1, x 2,..., x t } a set of t labelled data. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Input of y unclassified elements. = k the number of closest neighbours of y. = E the set of k nearest neighbours (NN). =  i (y) is the membership of y in the class I =  ij is the membership in the ith class of the jth vector of the labelled set.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 103 Fuzzy K-NN algorithm set k {Calculating the NN} for i = 1 to t Calculate distance from y to x i if i<=k then add x i to E else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 104 Fuzzy K-NN algorithm for i = 1 to t Calculate  i (x) using

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 105 KNN+Fuzzy C-Means algorithm = The idea is an two-layer clustering algorithm = First an unsupervised tracking of cluster centres is made using K-NN rules = The second layer involves one iteration of the fuzzy c-means to compute the membership degrees and the new fuzzy centres. = Ref. N. Zahit et all, Fuzzy Sets and Systems 120 (2001) 239-247

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 106 First Layer (K-NN) = Let X={x 1,…,x n } be a set of n unlabelled objects. = c is the number of clusters. = The first layer consists of partitioning X into c cells using the fist part of K-NN. = Each cell i is (1<=i<=c) represented as E i (y i, K-NN of y i, G i ) = G i is the center of cell E i and defined as

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 107 KNN-1FCMA settings = Let X={x 1,…,x n } be a set of n unlabelled objects. = Fix c the number of clusters. = Choose m>1 (nebulisation factor). = Set k = Integer(n/c –1). = Let I={1,2,…,n} be the set of all indexes of X.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 108 KNN-1FCMA algorithm step 1 Calculate G 0 for i = 1 to c Search in I for the index of the farthest object y i from G i-1 For j = 1 to n Calculate distance from y i to x j Calculate distance from y i to x j if j <= k then add x j to E i else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E i

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 109 KNN-1FCMA algorithm cont. Include y i in the set E i. Calculate G i. Delete y i index and the K-NN indexes of y i from I. if I   then for each remaining object x determine the minimum distance to any centre G i of E i. classify x to the nearest centre. update all centres.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 110 KNN-1FCMA algorithm step2  Compute the matrix U according to  Calculate all fuzzy centres using

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 111 Results KNN-1FCMA DataElemc Misclassification rate Number of Iterations avg FCMAKNN-1FCMAFCMA S12020011 S2603108 S38042010 S41206131319 IRIS231502141310 IRIS1003161712

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 112 Clustering based on Equivalence Relations = A relation crisp R on a universe X can be thought as a relation from X to X = R is an equivalence relation if it has the following three properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R Transitivity (x i, x j )  R and (x j, x k )  R  (x i, x k )  R

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 113 Crisp tolerance relation = R is a tolerance relation if it has the following two properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 114 Composition of Relations XYZ RS T=R°S

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 115 Composition of Crisp Relations The operation ° is similar to matrix multiplication.  The operation ° is similar to matrix multiplication.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 116 Transforming Relations = A tolerance relation can be transformed into a equivalence relation by at most (n-1) compositions with itself. = n is the cardinality of the set R.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 117 Example of crisp classification = Let X={1,2,3,4,5,6,7,8,9,10} = Let R be defined as the relation “for the identical remainder after dividing each element by 3”. = This relation is an equivalence relation

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 118 Relation Matrix 12345678910 11001001001 20100100100 30010010010 41001001001 50100100100 60010010010 71001001001 80100100100 90010010010 101001001001

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 119 Crisp Classification = Consider equivalent columns. = It is possible to group the elements in the following classes = R 0 = {3, 6, 9} = R 1 = {1, 4, 7, 10} = R 2 = {2, 5, 8}

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 120 Clustering and Fuzzy Equivalence Relations = A relation fuzzy R on a universe X can be thought as a relation from X to X = R is an equivalence relation if it has the following three properties:  Reflexivity: (x i, x i )  R or  (x i, x i ) = 1 Symmetry: (x i, x j )  R  (x j, x i )  R or  (x i, x j ) =  (x j, x i )  Transitivity: (x i, x j ) and (x j, x k )  R  (x i, x k )  R or if  (x i, x j ) = 1 and  (x j, x k ) = 2 then  (x i, x k ) = and >=min( 1, 2 )

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 121 Fuzzy tolerance relation = R is a tolerance relation if it has the following two properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 122 Composition of Fuzzy Relations The operation ° is similar to matrix multiplication.  The operation ° is similar to matrix multiplication.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 123 Distance Relation = Let X be a set of data on  l. = The distance function is a tolerance relation that can be transformed into a equivalence. = The relation R can be defined by the Minkowski distance formula.   is a constant that ensures that R  [0,1] and is equal to the inverse of the largest distance in X.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 124 Example of Fuzzy classification = Let X={(0,0),(1,1),(2,3),(3,1),(4,0)} be a set of points in  2. = Set q=2, Euclidean distances.  The largest distance is 4 ( x 1, x 5 ), so  =0.25. = The relation R can be calculated by the equation

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 125 Points to be classified

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 126 Tolerance matrix = The matrix calculated by the equation is = The is a tolerance relation that needs to be transformed into a equivalence relation

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 127 Equivalence matrix = The matrix transformed is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 128 Results of clustering = Taking -cuts of fuzzy equivalent relation at various values of =0.44, 0.5, 0.65 and 1.0 we get the following classes: = R.44 =[{x 1,x 2,x 3,x 4,x 5 }] = R.55 =[{x 1,x 2,x 4,x 5 }{x 3 }] = R.65 =[{x 1,x 2 },{x 3 },{x 4,x 5 }] = R 1.0 =[{x 1 },{x 2 },{x 3 },{x 4 },{x 5 }]

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 129 Gustafson-Kessel method = This method (GK) is fuzzy clustering method similar to the Fuzzy C-means (FCM). = The difference is the way the distance is calculated. = FCM uses Euclidean distances = GK uses Mahalobis distances

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 130 Gustafson-Kessel method = Mahalobis distance is calculated as = The matrices A i are given by

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 131 Gustafson-Kessel method = The Fuzzy Covariance Matrix is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 132 GK comments = The clusters are hyperellipsoids on the  l. = The hyperellipsoids have aproximately the same size. = In order to be possible to calculate S -1 the number of samples n must be at least equal to the number of dimensions l plus 1.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 133 Gath-Geva method = It is also known as Gaussian Mixture Decomposition. = It is similar to the FCM method = The Gauss distance is used instead of Euclidean distance. = The clusters do not have a definite shape anymore and have various sizes.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 134 Gath-Geva Method = Gauss distance is given by = A i =S i -1

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 135 Gath-Geva Method = The term P i is the probability of a sample belong to a cluster.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 136 Gath-Geva Comments = Pi is a parameter that influences the size of a cluster. = Bigger clusters attract more elements. = The exponential term makes more difficult to avoid local minima. = Usually another clustering method is used to initialise the partition matrix U.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 137 Cluster Validation = The number of clusters is not always previously known. = In many problems the number of classes is known but it is not the best configuration. = It is necessary to study methods to indicate and/or validate the number of classes.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 138 Partition Coefficient = This coefficient is defined as

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 139 Partition Coefficient = When F=1/c the system is entirely fuzzy, since every element belongs to all clusters with the same degree of membership = When F=1 the system is rigid and membership values are either 1 or 0.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 140 Partition Coefficient Example = The Partition Matrix is w1 w2 w3

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 141 Partition Coefficient Example = The Partition Matrix is w1 w2 w3 w4

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 142 Partition Coefficient comments = F is inversely proportional to the number of clusters. = F is more appropriated to validate the best partition among several runs of an algorithm

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 143 Partition Entropy = Partition Entropy is defined as = When H=0 the partition is rigid. = When H=log(c) the fuzziness is maximum. = 0 <= 1-F <= H

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 144 Partition Entropy comments = Partition Entropy (H) is directly proportional to the number of partitions. = H is more appropriated to validate the best partition among several runs of an algorithm.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 145 Compactness and Separation = CS is defined as = J is the objective function minimized by the FCM algorithm. = m is the fuzzy factor. = d min is minimum Euclidean distance between two the center of two clusters.

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 146 Compactness and Separation = The minimum distance is defined as = The complete formula is

*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 147 Compactness and Separation = This a very complete validation measure. = It validates the number of clusters and the checks the separation among clusters. = From our experiments it works well even when the degree of superposition is high.

Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Similar presentations

Presentation on theme: "Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Similar presentations

Presentation on theme: "Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ"— Presentation transcript:

Similar presentations

About project

Feedback