Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Slides:

Advertisements

Similar presentations

Advertisements

Cluster Algorithms Adriano Joaquim de O Cruz ©2006 UFRJ

Cluster Algorithms Adriano Joaquim de O Cruz ©2006 UFRJ

Copyright Jiawei Han, modified by Charles Ling for CS411a

Clustering k-mean clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Clustering Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data. The example below demonstrates.

What is Cluster Analysis?

Clustering Basic Concepts and Algorithms

Clustering Categorical Data The Case of Quran Verses

PARTITIONAL CLUSTERING

CS690L: Clustering References:

Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

CLUSTERING PROXIMITY MEASURES

Data Mining Techniques: Clustering

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

Cluster Analysis.

What is Cluster Analysis

Lecture 5 (Classification with Decision Trees)

1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.

1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.

What is Cluster Analysis?

Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz

Lecture 09 Clustering-based Learning

Clustering Unsupervised learning Generating “classes”

Chapter 1: Introduction to Statistics

Cluster Analysis Part I

More on Microarrays Chitta Baral Arizona State University.

1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.

START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.

CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.

Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.

Chapter 2: Getting to Know Your Data

Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.

DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.

CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.

Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.

Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

CLUSTER ANALYSIS. Cluster Analysis  Cluster analysis is a major technique for classifying a ‘mountain’ of information into manageable meaningful piles.

Unsupervised Learning

What Is Cluster Analysis?

Data Transformation: Normalization

Clustering CSC 600: Data Mining Class 21.

Data Mining: Concepts and Techniques

Data Mining K-means Algorithm

Topic 3: Cluster Analysis

Data Mining Chapter 4 Cluster Analysis Part 1

Selected Topics in AI: Data Clustering

K Nearest Neighbor Classification

CSE572, CBS598: Data Mining by H. Liu

Clustering and Multidimensional Scaling

CSCI N317 Computation for Scientific Applications Unit Weka

CSE572, CBS572: Data Mining by H. Liu

CSE572, CBS572: Data Mining by H. Liu

Data Mining – Chapter 4 Cluster Analysis Part 2

What Is Good Clustering?

Clustering Wei Wang.

Text Categorization Berlin Chen 2003 Reference:

Topic 5: Cluster Analysis

CSE572: Data Mining by H. Liu

Unsupervised Learning

Presentation transcript:

Cluster Analysis Adriano Joaquim de O Cruz ©2002 NCE/UFRJ

Adriano Cruz *NCE e IM - UFRJ Cluster 2 What is cluster analysis? = The process of grouping a set of physical or abstract objects into classes of similar objects. = The class label of each class is unknown. = Classification separates objects into classes when the labels are known.

Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis? cont. = Clustering is a form of learning by observations. = Neural Networks learn by examples. = Unsupervised learning.

Adriano Cruz *NCE e IM - UFRJ Cluster 4Applications = In business helps to discover distinct groups of customers. = In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster. = Pre-processing step for classification. = Pattern recognition.

Adriano Cruz *NCE e IM - UFRJ Cluster 5Requirements = Scalability: work with large databases. = Ability to deal with different types of attributes (not only interval based data). = Clusters of arbitrary shape, not only spherical. = Minimal requirements about domain. = Ability do deal with noisy data.

Adriano Cruz *NCE e IM - UFRJ Cluster 6 Requirements cont. = Insensitivity to the order of input records. = Work with samples of high dimensionality. = Constrained-based clustering = Interpretability and usability: results should be easily interpretable.

Adriano Cruz *NCE e IM - UFRJ Cluster 7 Heuristic Clustering Techniques = Incomplete or heuristic clustering: geometrical methods or projection techniques. = Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions. = Heuristic methods based on visualisation are used to determine the clusters.

Adriano Cruz *NCE e IM - UFRJ Cluster 8 Deterministic Crisp Clustering = Each datum will be assigned to only one cluster. = Each cluster partition defines a ordinary partition of the data set.

Adriano Cruz *NCE e IM - UFRJ Cluster 9 Overlapping Crisp Clustering = Each datum will be assigned to at least one cluster. = Elements may belong to more than one cluster at various degrees.

Adriano Cruz *NCE e IM - UFRJ Cluster 10 Probabilistic Clustering = For each element, a probabilistic distribution over the clusters is determined. = The distribution specifies the probability with which a datum is assigned to a cluster. = If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.

Adriano Cruz *NCE e IM - UFRJ Cluster 11 Possibilistic Clustering = Degrees of membership or possibility indicate to what extent a datum belongs to the clusters. = Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.

Adriano Cruz *NCE e IM - UFRJ Cluster 12 Hierarchical Clustering = Descending techniques: they divide the data into more fine-grained classes. = Ascending techniques: they combine small classes into more coarse-grained ones.

Adriano Cruz *NCE e IM - UFRJ Cluster 13 Objective Function Clustering = An objective function assigns to each cluster partition a values that has to be optimised. = This is strictly an optimisation problem.

Adriano Cruz *NCE e IM - UFRJ Cluster 14 Data Matrices = Data matrix: represents n objects with p characteristics. Ex. person = {age, sex, income,...} = Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.

Adriano Cruz *NCE e IM - UFRJ Cluster 15Dissimilarities = Dissimilarity measures some form of distance between objects. = Clustering algorithms use dissimilarities to cluster data. = How can dissimilarities be measured?

Adriano Cruz *NCE e IM - UFRJ Cluster 16 Data Types = Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. = Binary variables have only two states. Ex. smoker, fever, client, owner. = Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.

Adriano Cruz *NCE e IM - UFRJ Cluster 17 Data Types cont. = Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks. = Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population

Adriano Cruz *NCE e IM - UFRJ Cluster 18 Interval-scaled variables = Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. = Interval-scaled variables are dependent on the units used. = Measurement unit can affect analysis, so standardisation may be used.

Adriano Cruz *NCE e IM - UFRJ Cluster 19Standardisation = Converting original measurements to unitless values. = Attempts to give all variables the equal weight. = Useful when there is no prior knowledge of the data.

Adriano Cruz *NCE e IM - UFRJ Cluster 20 Standardisation algorithm = Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. = The transformed scores will have a mean of zero and standard deviation of one. = It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.

Adriano Cruz *NCE e IM - UFRJ Cluster 21 Standardisation algorithm = Consider n values of a variable x. = Calculate the mean value. = Calculate the standard deviation. = Calculate the z-score.

Adriano Cruz *NCE e IM - UFRJ Cluster 22 Z-scores example

Adriano Cruz *NCE e IM - UFRJ Cluster 23 Real heights and ages charts

Adriano Cruz *NCE e IM - UFRJ Cluster 24 Z-scores for heights and ages

Adriano Cruz *NCE e IM - UFRJ Cluster 25 Data chart

Adriano Cruz *NCE e IM - UFRJ Cluster 26 How to calculate dissimilarities? = The most popular methods are based on the distance between pairs of objects. = Minkowski distance: = p is the number of characteristics = q is the distance type = q=2 (Euclides distance), q=1 (Manhattan)

Adriano Cruz *NCE e IM - UFRJ Cluster 27Distances

Adriano Cruz *NCE e IM - UFRJ Cluster 28 Binary Variables = Binary variables have only two states. = States can be symmetric or asymmetric. = Binary variables are symmetric if both states are equally valuable. Ex. gender = When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)

Adriano Cruz *NCE e IM - UFRJ Cluster 29 Contingency tables = Consider objects described by p binary variables

Adriano Cruz *NCE e IM - UFRJ Cluster 30 Symmetric Variables = Similarity based on symmetric variables is invariant. = Simple matching coefficient:

Adriano Cruz *NCE e IM - UFRJ Cluster 31 Asymmetric Variables = Similarity based on asymmetric variables is no invariant. = Two ones are more important than two zeros = Jacard coefficient:

Adriano Cruz *NCE e IM - UFRJ Cluster 32 Computing dissimilarities

Adriano Cruz *NCE e IM - UFRJ Cluster 33 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.

Adriano Cruz *NCE e IM - UFRJ Cluster 34 Nominal Variables = A nominal variable is a generalisation of the binary variable. = A nominal variable can take more than two states = Ex. Marital status: married, single, divorced = Each state can be represented by a number o letter = There is no specific ordering

Adriano Cruz *NCE e IM - UFRJ Cluster 35 Computing dissimilarities = Consider two objects i and j, described by nominal variables = Each object has p characteristics = m is the number of matches

Adriano Cruz *NCE e IM - UFRJ Cluster 36 Binarising nominal variables = An nominal variable can encoded to create a new binary variable for each state = Example: = Marital state = {married, single, divorced} = Married: 1=yes – 0=no = Single: 1=yes – 0=no = Divorced: 1=yes – 0=no = Ex. Marital state = {married} = married = 1, single = 0, divorced = 0

Adriano Cruz *NCE e IM - UFRJ Cluster 37 Ordinal variables = A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence = Ex. Bronze, silver and gold medals = Ex. Assistant, associate, full member

Adriano Cruz *NCE e IM - UFRJ Cluster 38 Computing dissimilarities = Consider n objects defined by a set of ordinal variables = f is one of these ordinal variables and have M f states.  These states define the ranking r f i {1,…, M f }.

Adriano Cruz *NCE e IM - UFRJ Cluster 39 Steps to calculate dissimilarities  Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }. = Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation = Dissimilarity can be computed using distance measures of interval-scaled variables

Adriano Cruz *NCE e IM - UFRJ Cluster 40 Ratio-scaled variables = Variables on a non-linear scale, such as exponential = To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled

Adriano Cruz *NCE e IM - UFRJ Cluster 41 Variables of mixed types = One technique is to bring all variables onto a common scale of the interval [ ] = Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is

Adriano Cruz *NCE e IM - UFRJ Cluster 42 = Dissimilarity is between i and j is Variables of mixed types

Adriano Cruz *NCE e IM - UFRJ Cluster 43 = The contribution of each variable is dependent on its type = f is binary or nominal: = f is interval-based: = f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont

Adriano Cruz *NCE e IM - UFRJ Cluster 44 Clustering Methods = Partitioning = Hierarchical = Density-based = Grid-based = Model-based

Adriano Cruz *NCE e IM - UFRJ Cluster 45 Partitioning Methods = Given n objects creates k partitions. = Each partition must contain at least one element. = It uses an iterative relocation technique to improve partitioning. = Distance is the usual criterion.

Adriano Cruz *NCE e IM - UFRJ Cluster 46 Partitioning Methods cont. = They work well for finding spherical-shaped clusters. = They are not efficient on very large databases. = K-means where each cluster is represented by the mean value of the objects in the cluster. = K-medoids where each cluster is represented by an object near the centre of the cluster.

Adriano Cruz *NCE e IM - UFRJ Cluster 47 Hierarchical Methods = Creates a hierarchical decomposition of the set = Agglomerative approaches start with each object forming a separate group = Merges objects or groups until all objects belong to one group or a termination condition occurs = Divisive approaches starts with all objects in the same cluster = Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs

Adriano Cruz *NCE e IM - UFRJ Cluster 48 Density-based methods = Method creates clusters until the density in the neighbourhood exceeds some threshold = Able to find clusters of arbitrary shapes

Adriano Cruz *NCE e IM - UFRJ Cluster 49 Grid-based methods = Grid methods divide the object space into finite number of cells forming a grid- like structure = Fast processing time, independent of the number of objects

Adriano Cruz *NCE e IM - UFRJ Cluster 50 Model-based methods = Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model =

Adriano Cruz *NCE e IM - UFRJ Cluster 51 Partition methods = Given a database of n objects a partition method organises them into k clusters (k<= n) = The methods try to minimise an objective function such as distance = Similar objects are “close” to each other

Adriano Cruz *NCE e IM - UFRJ Cluster 52 K-means algorithm = Based on the Euclidean distances among elements of the cluster = Centre of the cluster is the mean value of the objects in the cluster. = Classifies objects in a hard way. Each object belongs to a single cluster.

Adriano Cruz *NCE e IM - UFRJ Cluster 53 K-means algorithm = Consider n (X={x1, x2,..., xn}) objects and k clusters. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x im ). = Consider A a set of k clusters (A={A 1, A 2,..., A k }).

Adriano Cruz *NCE e IM - UFRJ Cluster 54 K-means properties = The union of all clusters makes the Universe = No element belongs to more than one cluster = There is no empty cluster

Adriano Cruz *NCE e IM - UFRJ Cluster 55 Membership function

Adriano Cruz *NCE e IM - UFRJ Cluster 56 Membership matrix U = Matrix containing the values of inclusion of each element into each cluster (0 or 1). = Matrix has k (clusters) lines and n (elements) columns. = The sum of all elements in the column must be equal to one (element belongs only to one cluster = The sum of each line must be less than n e grater than 0. No empty cluster, or cluster containing all elements.

Adriano Cruz *NCE e IM - UFRJ Cluster 57 Matrix examples X1X2X3 X4X5X6 Two examples of clustering. What do the clusters represent?

Adriano Cruz *NCE e IM - UFRJ Cluster 58 Matrix examples cont. X1X2X3 X4X5X6 U1 and U2 are the same matrices.

Adriano Cruz *NCE e IM - UFRJ Cluster 59 How many clusters? = The cardinality of any hard k-partition of n elements is

Adriano Cruz *NCE e IM - UFRJ Cluster 60 How many clusters (example)? = Consider the matrix U2 (k=3, n=6)

Adriano Cruz *NCE e IM - UFRJ Cluster 61 K-means inputs and outputs = Inputs: the number of clusters k and a database containing n objects with l characteristics each. = Output: A set of k clusters that minimises the square-error criterion.

Adriano Cruz *NCE e IM - UFRJ Cluster 62 Number of Clusters

Adriano Cruz *NCE e IM - UFRJ Cluster 63 K-means algorithm v1 Arbitrarily assigns each object to a cluster (matrix U). Repeat Update the cluster centres; Update the cluster centres; Reassign objects to the clusters to which the objects are most similar; Reassign objects to the clusters to which the objects are most similar; Until no change;

Adriano Cruz *NCE e IM - UFRJ Cluster 64 K-means algorithm v2 Arbitrarily choose k objects as the initial cluster centres. Repeat Reassign objects to the clusters to which the objects are most similar. Reassign objects to the clusters to which the objects are most similar. Update the cluster centres. Until no change

Adriano Cruz *NCE e IM - UFRJ Cluster 65 Algorithm details = The algorithm tries to minimise the function = d ie is the distance between the element x e (m characteristics) and the centre of the cluster i (v i )

Adriano Cruz *NCE e IM - UFRJ Cluster 66 Cluster Centre = The centre of the cluster i (v i ) is l characteristics vector. = The jth co-ordinate is calculated as

Adriano Cruz *NCE e IM - UFRJ Cluster 67 Detailed Algorithm = Choose k (number of clusters).  Set error (  > 0) and step (r=0). = Arbitrarily set matrix U (r). Do not forget, each element belongs to a single cluster, no empty cluster and no cluster has all elements.

Adriano Cruz *NCE e IM - UFRJ Cluster 68 Detailed Algorithm cont. Repeat Calculate the centre of the clusters v i (r) Calculate the distance d i (r) of each point to the centre of the clusters Generate U (r+1) recalculating all characteristic functions using the equations Until ||U (r+1) -U (r) || < 

Adriano Cruz *NCE e IM - UFRJ Cluster 69 Matrix norms = Consider a matrix U of n lines and n columns: = Column norm = Line norm

Adriano Cruz *NCE e IM - UFRJ Cluster 70 K-means problems? = Suitable when clusters are compact clouds well separated. = Scalable because computational complexity is O(nkr). = Necessity of choosing k is disadvantage. = Not suitable for nonconvex shapes. = It is sensitive to noise and outliers because they influence the means. = Depends on the initial allocation.

Adriano Cruz *NCE e IM - UFRJ Cluster 71 Examples of results

Adriano Cruz *NCE e IM - UFRJ Cluster 72 K-medoids methods = K-means is sensitive to outliers since an object with an extremely large value may distort the distribution of data. = Instead of taking the mean value the most centrally object (medoid) is used as reference point. = The algorithm minimizes the sum of dissimilarities between each object and the medoid (similar to k-means)

Adriano Cruz *NCE e IM - UFRJ Cluster 73 K-medoids strategies = Find k-medoids arbitrarily. = Each remaining object is clustered with the medoid to which is the most similar. = Then iteratively replaces one of the medoids by a non-medoid as long as the quality of the clustering is improved. = The quality is measured using a cost function that measures the average dissimilarity between the objects and the medoid of its cluster.

Adriano Cruz *NCE e IM - UFRJ Cluster 74 Reassignment costs = Each time a reassignment occurs a difference in square-error J is contributed. = The cost function J calculates the total cost of replacing a current medoid by a non-medoid. = If the total cost is negative then m j is replaced by m random, otherwise the replacement is not accepted.

Adriano Cruz *NCE e IM - UFRJ Cluster 75 Replacing medoids case 1 = Object p belongs to medoid m j. If m j is replaced by m random and p is closest to one of m i (i<>j), then reassigns p to m i mimi mjmj m random p

Adriano Cruz *NCE e IM - UFRJ Cluster 76 m random Replacing medoids case 2 = Object p belongs to medoid m j. If m j is replaced by m random and p is closest to one of m random, then reassigns p to m random mimi mjmj p

Adriano Cruz *NCE e IM - UFRJ Cluster 77 m random Replacing medoids case 3 = Object p belongs to medoid m i (i<>j). If m j is replaced by m random and p is still close to m i,then does not change. mimi mjmj p

Adriano Cruz *NCE e IM - UFRJ Cluster 78 m random Replacing medoids case 4 = Object p belongs to medoid m i (i<>j). If m j is replaced by m random and p is closest to m random,then reassigns p to m random. mimi mjmj p

Adriano Cruz *NCE e IM - UFRJ Cluster 79 K-medoid algorithm Arbitrarily choose k objects as the initial medoids. Repeat Assign each remaining object to the cluster with the nearest medoid; Randomly select a nonmedoid object, m random ; Compute the total cost J of swapping m j with m random ; If J<0 then swap o j with o random ; Until no change

Adriano Cruz *NCE e IM - UFRJ Cluster 80Comparisons? = K-medoids is more robust than k-means in presence of noise and outliers. = K-means is less costly in terms of processing time.

Adriano Cruz *NCE e IM - UFRJ Cluster 81 Fuzzy C-means = Fuzzy version of K-means = Elements may belong to more than one cluster = Values of characteristic function range from 0 to 1.

Adriano Cruz *NCE e IM - UFRJ Cluster 82 Fuzzy C-means setup = Consider n (X={x 1, x 2,..., x n }) objects and c clusters. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Consider A a set of k clusters (A={A 1, A 2,..., A k }).

Adriano Cruz *NCE e IM - UFRJ Cluster 83 Fuzzy C-means properties = The union of all clusters makes the Universe = There is no empty cluster

Adriano Cruz *NCE e IM - UFRJ Cluster 84 Membership function

Adriano Cruz *NCE e IM - UFRJ Cluster 85 Membership matrix U = Matrix containing the values of inclusion of each element into each cluster [0,1]. = Matrix has c (clusters) lines and n (elements) columns. = The sum of all elements in the column must be equal to one. = The sum of each line must be less than n e grater than 0. No empty cluster, or cluster containing all elements.

Adriano Cruz *NCE e IM - UFRJ Cluster 86 Matrix examples X1X2X3 X4X5X6 Two examples of clustering. What do the clusters represent?

Adriano Cruz *NCE e IM - UFRJ Cluster 87 Fuzzy C-means algorithm v1 Arbitrarily assigns each object to a cluster (matrix U). Repeat Update the cluster centres; Update the cluster centres; Reassign objects to the clusters to which the objects are most similar; Reassign objects to the clusters to which the objects are most similar; Until no change;

Adriano Cruz *NCE e IM - UFRJ Cluster 88 Fuzzy C-means algorithm v2 Arbitrarily choose c objects as the initial cluster centres. Repeat Reassign objects to the clusters to which the objects are most similar. Reassign objects to the clusters to which the objects are most similar. Update the cluster centres. Until no change

Adriano Cruz *NCE e IM - UFRJ Cluster 89 Algorithm details = The algorithm tries to minimise the function, m is the nebulisation factor. = d ie is the distance between the element x e (l characteristics) and the centre of the cluster i (v i )

Adriano Cruz *NCE e IM - UFRJ Cluster 90 Nebulisation factor = m is the nebulisation factor. = This value has a range [1,  ) = If m=1 the the system is crisp. = If m  the all the membership values tend to 1/c = The most common values are 1.25 and 2.0

Adriano Cruz *NCE e IM - UFRJ Cluster 91 Cluster Centre = The centre of the cluster i (v i ) is a l characteristics vector. = The jth co-ordinate is calculated as

Adriano Cruz *NCE e IM - UFRJ Cluster 92 Detailed Algorithm = Choose k (number of clusters).  Set error (  > 0), nebulisation factor (m) and step (r=0). = Arbitrarily set matrix U (r). Do not forget, each element belongs to a single cluster, no empty cluster and no cluster has all elements.

Adriano Cruz *NCE e IM - UFRJ Cluster 93 Detailed Algorithm cont. Repeat Calculate the centre of the clusters v i (r) Calculate the distance d i (r) of each point to the centre of the clusters Generate U (r+1) recalculating all characteristic functions(How?) Until ||U (r+1) -U (r) || < 

Adriano Cruz *NCE e IM - UFRJ Cluster 94 How to recalculate? If there is any distance equals to zero then the element belongs to this cluster and no one else. If there is any distance equals to zero then the element belongs to this cluster and no one else. Otherwise the membership grade is weighted average of the distances to all centers. Otherwise the membership grade is weighted average of the distances to all centers.

Adriano Cruz *NCE e IM - UFRJ Cluster 95 Example of clustering result

Adriano Cruz *NCE e IM - UFRJ Cluster 96 Crisp K-NN = Supervised clustering method (Classification method). = Classes are defined before hand. = Classes are characterized by sets of elements. = The number of elements may differ among classes. = The main idea is to associate the sample to the class containing more neighbours.

Adriano Cruz *NCE e IM - UFRJ Cluster 97 Crisp K-NN w2w2w2w2 w1w1w1w1 w3w3w3w3 w 13 w 10 w9w9w9w9 w4w4w4w4 w5w5w5w5 w 14 w 11 w 12 w7w7w7w7 w8w8w8w8 w6w6w6w6 s Class 1 Class 2 Class 3 Class 4 Class 5 = 3 nearest neighbours, and sample s is closest to pattern w 6 on class 5.

Adriano Cruz *NCE e IM - UFRJ Cluster 98 Crisp K-NN algorithm = Consider X={x 1, x 2,..., x t } a set of t labelled data. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Input of y unclassified elements. = k the number of closest neighbours of y. = E the set of k nearest neighbours (NN).

Adriano Cruz *NCE e IM - UFRJ Cluster 99 Crisp K-NN algorithm set k {Calculating the NN} for i = 1 to t Calculate distance from y to x i if i<=k then add x i to E else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E

Adriano Cruz *NCE e IM - UFRJ Cluster 100 K-NN algorithm cont. Determine the majority class represented in the set E and include y in this class. if there is a draw, then calculate the sum of distances from y to all neighbours in each class in the draw if the sums are different then add x i to class with smallest sum else add x i to class where last minimum was found

Adriano Cruz *NCE e IM - UFRJ Cluster 101 Fuzzy K-NN = The basis of the algorithm is to assign membership as a function of the object’s distance from its K-nearest neighbours and the memberships in the possible classes.

Adriano Cruz *NCE e IM - UFRJ Cluster 102 Fuzzy K-NN = Consider X={x 1, x 2,..., x t } a set of t labelled data. = Each object x i is defined by l characteristics x i =(x i1, x i2,..., x il ). = Input of y unclassified elements. = k the number of closest neighbours of y. = E the set of k nearest neighbours (NN). =  i (y) is the membership of y in the class I =  ij is the membership in the ith class of the jth vector of the labelled set.

Adriano Cruz *NCE e IM - UFRJ Cluster 103 Fuzzy K-NN algorithm set k {Calculating the NN} for i = 1 to t Calculate distance from y to x i if i<=k then add x i to E else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E

Adriano Cruz *NCE e IM - UFRJ Cluster 104 Fuzzy K-NN algorithm for i = 1 to t Calculate  i (x) using

Adriano Cruz *NCE e IM - UFRJ Cluster 105 KNN+Fuzzy C-Means algorithm = The idea is an two-layer clustering algorithm = First an unsupervised tracking of cluster centres is made using K-NN rules = The second layer involves one iteration of the fuzzy c-means to compute the membership degrees and the new fuzzy centres. = Ref. N. Zahit et all, Fuzzy Sets and Systems 120 (2001)

Adriano Cruz *NCE e IM - UFRJ Cluster 106 First Layer (K-NN) = Let X={x 1,…,x n } be a set of n unlabelled objects. = c is the number of clusters. = The first layer consists of partitioning X into c cells using the fist part of K-NN. = Each cell i is (1<=i<=c) represented as E i (y i, K-NN of y i, G i ) = G i is the center of cell E i and defined as

Adriano Cruz *NCE e IM - UFRJ Cluster 107 KNN-1FCMA settings = Let X={x 1,…,x n } be a set of n unlabelled objects. = Fix c the number of clusters. = Choose m>1 (nebulisation factor). = Set k = Integer(n/c –1). = Let I={1,2,…,n} be the set of all indexes of X.

Adriano Cruz *NCE e IM - UFRJ Cluster 108 KNN-1FCMA algorithm step 1 Calculate G 0 for i = 1 to c Search in I for the index of the farthest object y i from G i-1 For j = 1 to n Calculate distance from y i to x j Calculate distance from y i to x j if j <= k then add x j to E i else if x i is closer to y than any previous NN then delete the farthest neighbour and include x i in the set E i

Adriano Cruz *NCE e IM - UFRJ Cluster 109 KNN-1FCMA algorithm cont. Include y i in the set E i. Calculate G i. Delete y i index and the K-NN indexes of y i from I. if I   then for each remaining object x determine the minimum distance to any centre G i of E i. classify x to the nearest centre. update all centres.

Adriano Cruz *NCE e IM - UFRJ Cluster 110 KNN-1FCMA algorithm step2  Compute the matrix U according to  Calculate all fuzzy centres using

Adriano Cruz *NCE e IM - UFRJ Cluster 111 Results KNN-1FCMA DataElemc Misclassification rate Number of Iterations avg FCMAKNN-1FCMAFCMA S S S S IRIS IRIS

Adriano Cruz *NCE e IM - UFRJ Cluster 112 Clustering based on Equivalence Relations = A relation crisp R on a universe X can be thought as a relation from X to X = R is an equivalence relation if it has the following three properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R Transitivity (x i, x j )  R and (x j, x k )  R  (x i, x k )  R

Adriano Cruz *NCE e IM - UFRJ Cluster 113 Crisp tolerance relation = R is a tolerance relation if it has the following two properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R

Adriano Cruz *NCE e IM - UFRJ Cluster 114 Composition of Relations XYZ RS T=R°S

Adriano Cruz *NCE e IM - UFRJ Cluster 115 Composition of Crisp Relations The operation ° is similar to matrix multiplication.  The operation ° is similar to matrix multiplication.

Adriano Cruz *NCE e IM - UFRJ Cluster 116 Transforming Relations = A tolerance relation can be transformed into a equivalence relation by at most (n-1) compositions with itself. = n is the cardinality of the set R.

Adriano Cruz *NCE e IM - UFRJ Cluster 117 Example of crisp classification = Let X={1,2,3,4,5,6,7,8,9,10} = Let R be defined as the relation “for the identical remainder after dividing each element by 3”. = This relation is an equivalence relation

Adriano Cruz *NCE e IM - UFRJ Cluster 118 Relation Matrix

Adriano Cruz *NCE e IM - UFRJ Cluster 119 Crisp Classification = Consider equivalent columns. = It is possible to group the elements in the following classes = R 0 = {3, 6, 9} = R 1 = {1, 4, 7, 10} = R 2 = {2, 5, 8}

Adriano Cruz *NCE e IM - UFRJ Cluster 120 Clustering and Fuzzy Equivalence Relations = A relation fuzzy R on a universe X can be thought as a relation from X to X = R is an equivalence relation if it has the following three properties:  Reflexivity: (x i, x i )  R or  (x i, x i ) = 1 Symmetry: (x i, x j )  R  (x j, x i )  R or  (x i, x j ) =  (x j, x i )  Transitivity: (x i, x j ) and (x j, x k )  R  (x i, x k )  R or if  (x i, x j ) = 1 and  (x j, x k ) = 2 then  (x i, x k ) = and >=min( 1, 2 )

Adriano Cruz *NCE e IM - UFRJ Cluster 121 Fuzzy tolerance relation = R is a tolerance relation if it has the following two properties: Reflexivity (x i, x i )  R Symmetry (x i, x j )  R  (x j, x i )  R

Adriano Cruz *NCE e IM - UFRJ Cluster 122 Composition of Fuzzy Relations The operation ° is similar to matrix multiplication.  The operation ° is similar to matrix multiplication.

Adriano Cruz *NCE e IM - UFRJ Cluster 123 Distance Relation = Let X be a set of data on  l. = The distance function is a tolerance relation that can be transformed into a equivalence. = The relation R can be defined by the Minkowski distance formula.   is a constant that ensures that R  [0,1] and is equal to the inverse of the largest distance in X.

Adriano Cruz *NCE e IM - UFRJ Cluster 124 Example of Fuzzy classification = Let X={(0,0),(1,1),(2,3),(3,1),(4,0)} be a set of points in  2. = Set q=2, Euclidean distances.  The largest distance is 4 ( x 1, x 5 ), so  =0.25. = The relation R can be calculated by the equation

Adriano Cruz *NCE e IM - UFRJ Cluster 125 Points to be classified

Adriano Cruz *NCE e IM - UFRJ Cluster 126 Tolerance matrix = The matrix calculated by the equation is = The is a tolerance relation that needs to be transformed into a equivalence relation

Adriano Cruz *NCE e IM - UFRJ Cluster 127 Equivalence matrix = The matrix transformed is

Adriano Cruz *NCE e IM - UFRJ Cluster 128 Results of clustering = Taking -cuts of fuzzy equivalent relation at various values of =0.44, 0.5, 0.65 and 1.0 we get the following classes: = R.44 =[{x 1,x 2,x 3,x 4,x 5 }] = R.55 =[{x 1,x 2,x 4,x 5 }{x 3 }] = R.65 =[{x 1,x 2 },{x 3 },{x 4,x 5 }] = R 1.0 =[{x 1 },{x 2 },{x 3 },{x 4 },{x 5 }]

Adriano Cruz *NCE e IM - UFRJ Cluster 129 Gustafson-Kessel method = This method (GK) is fuzzy clustering method similar to the Fuzzy C-means (FCM). = The difference is the way the distance is calculated. = FCM uses Euclidean distances = GK uses Mahalobis distances

Adriano Cruz *NCE e IM - UFRJ Cluster 130 Gustafson-Kessel method = Mahalobis distance is calculated as = The matrices A i are given by

Adriano Cruz *NCE e IM - UFRJ Cluster 131 Gustafson-Kessel method = The Fuzzy Covariance Matrix is

Adriano Cruz *NCE e IM - UFRJ Cluster 132 GK comments = The clusters are hyperellipsoids on the  l. = The hyperellipsoids have aproximately the same size. = In order to be possible to calculate S -1 the number of samples n must be at least equal to the number of dimensions l plus 1.

Adriano Cruz *NCE e IM - UFRJ Cluster 133 Gath-Geva method = It is also known as Gaussian Mixture Decomposition. = It is similar to the FCM method = The Gauss distance is used instead of Euclidean distance. = The clusters do not have a definite shape anymore and have various sizes.

Adriano Cruz *NCE e IM - UFRJ Cluster 134 Gath-Geva Method = Gauss distance is given by = A i =S i -1

Adriano Cruz *NCE e IM - UFRJ Cluster 135 Gath-Geva Method = The term P i is the probability of a sample belong to a cluster.

Adriano Cruz *NCE e IM - UFRJ Cluster 136 Gath-Geva Comments = Pi is a parameter that influences the size of a cluster. = Bigger clusters attract more elements. = The exponential term makes more difficult to avoid local minima. = Usually another clustering method is used to initialise the partition matrix U.

Adriano Cruz *NCE e IM - UFRJ Cluster 137 Cluster Validation = The number of clusters is not always previously known. = In many problems the number of classes is known but it is not the best configuration. = It is necessary to study methods to indicate and/or validate the number of classes.

Adriano Cruz *NCE e IM - UFRJ Cluster 138 Partition Coefficient = This coefficient is defined as

Adriano Cruz *NCE e IM - UFRJ Cluster 139 Partition Coefficient = When F=1/c the system is entirely fuzzy, since every element belongs to all clusters with the same degree of membership = When F=1 the system is rigid and membership values are either 1 or 0.

Adriano Cruz *NCE e IM - UFRJ Cluster 140 Partition Coefficient Example = The Partition Matrix is w1 w2 w3

Adriano Cruz *NCE e IM - UFRJ Cluster 141 Partition Coefficient Example = The Partition Matrix is w1 w2 w3 w4

Adriano Cruz *NCE e IM - UFRJ Cluster 142 Partition Coefficient comments = F is inversely proportional to the number of clusters. = F is more appropriated to validate the best partition among several runs of an algorithm

Adriano Cruz *NCE e IM - UFRJ Cluster 143 Partition Entropy = Partition Entropy is defined as = When H=0 the partition is rigid. = When H=log(c) the fuzziness is maximum. = 0 <= 1-F <= H

Adriano Cruz *NCE e IM - UFRJ Cluster 144 Partition Entropy comments = Partition Entropy (H) is directly proportional to the number of partitions. = H is more appropriated to validate the best partition among several runs of an algorithm.

Adriano Cruz *NCE e IM - UFRJ Cluster 145 Compactness and Separation = CS is defined as = J is the objective function minimized by the FCM algorithm. = m is the fuzzy factor. = d min is minimum Euclidean distance between two the center of two clusters.

Adriano Cruz *NCE e IM - UFRJ Cluster 146 Compactness and Separation = The minimum distance is defined as = The complete formula is

Adriano Cruz *NCE e IM - UFRJ Cluster 147 Compactness and Separation = This a very complete validation measure. = It validates the number of clusters and the checks the separation among clusters. = From our experiments it works well even when the degree of superposition is high.