Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ
Introduction
Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis? The process of grouping a set of physical or abstract objects into classes of similar objects. The class label of each class is unknown. Classification separates objects into classes when the labels are known.
Adriano Cruz *NCE e IM - UFRJ Cluster 4 What is cluster analysis? cont. Clustering is a form of learning by observations. Neural Networks learn by examples. Unsupervised learning.
Adriano Cruz *NCE e IM - UFRJ Cluster 5Applications In business helps to discover distinct groups of customers. In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster. Pre-processing step for classification. Pattern recognition.
Adriano Cruz *NCE e IM - UFRJ Cluster 6Requirements Scalability: work with large databases. Ability to deal with different types of attributes (not only interval based data). Clusters of arbitrary shape, not only spherical. Minimal requirements about domain. Ability do deal with noisy data.
Adriano Cruz *NCE e IM - UFRJ Cluster 7 Requirements cont. Insensitivity to the order of input records. Work with samples of high dimensionality. Constrained-based clustering Interpretability and usability: results should be easily interpretable.
Adriano Cruz *NCE e IM - UFRJ Cluster 8 Sensitivity to Input Order Some algorithms are sensitive to the order of input data Leader algorithm is an example Ellipse: ; Triangle:
Clustering Techniques
Adriano Cruz *NCE e IM - UFRJ Cluster 10 Heuristic Clustering Techniques Incomplete or heuristic clustering: geometrical methods or projection techniques. Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions. Heuristic methods based on visualisation are used to determine the clusters.
Adriano Cruz *NCE e IM - UFRJ Cluster 11 Deterministic Crisp Clustering Each datum will be assigned to only one cluster. Each cluster partition defines a ordinary partition of the data set.
Adriano Cruz *NCE e IM - UFRJ Cluster 12 Overlapping Crisp Clustering Each datum will be assigned to at least one cluster. Elements may belong to more than one cluster at various degrees.
Adriano Cruz *NCE e IM - UFRJ Cluster 13 Probabilistic Clustering For each element, a probabilistic distribution over the clusters is determined. The distribution specifies the probability with which a datum is assigned to a cluster. If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.
Adriano Cruz *NCE e IM - UFRJ Cluster 14 Possibilistic Clustering Degrees of membership or possibility indicate to what extent a datum belongs to the clusters. Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.
Adriano Cruz *NCE e IM - UFRJ Cluster 15 Hierarchical Clustering Descending techniques: they divide the data into more fine-grained classes. Ascending techniques: they combine small classes into more coarse-grained ones.
Adriano Cruz *NCE e IM - UFRJ Cluster 16 Objective Function Clustering An objective function assigns to each cluster partition values that have to be optimised. This is strictly an optimisation problem.
Data Types
Adriano Cruz *NCE e IM - UFRJ Cluster 18 Data Types Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Binary variables have only two states. Ex. smoker, fever, client, owner. Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.
Adriano Cruz *NCE e IM - UFRJ Cluster 19 Data Types cont. Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks. Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population
Adriano Cruz *NCE e IM - UFRJ Cluster 20 Interval-scaled variables Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Interval-scaled variables are dependent on the units used. Measurement unit can affect analysis, so standardisation should be used.
Adriano Cruz *NCE e IM - UFRJ Cluster 21Problems Person Age (yr) Height (cm) A35190 B40190 C35160 D40160
Adriano Cruz *NCE e IM - UFRJ Cluster 22Standardisation Converting original measurements to unitless values. Attempts to give all variables the equal weight. Useful when there is no prior knowledge of the data.
Adriano Cruz *NCE e IM - UFRJ Cluster 23 Standardisation algorithm Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. The transformed scores will have a mean of zero and standard deviation of one. It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.
Adriano Cruz *NCE e IM - UFRJ Cluster 24 Standardisation algorithm Consider n values of a variable x. Calculate the mean value. Calculate the standard deviation. Calculate the z-score.
Adriano Cruz *NCE e IM - UFRJ Cluster 25 Z-scores example
Adriano Cruz *NCE e IM - UFRJ Cluster 26 Real heights and ages charts
Adriano Cruz *NCE e IM - UFRJ Cluster 27 Z-scores for heights and ages
Adriano Cruz *NCE e IM - UFRJ Cluster 28 Data chart
Adriano Cruz *NCE e IM - UFRJ Cluster 29 Data chart
Similarities
Adriano Cruz *NCE e IM - UFRJ Cluster 31 Data Matrices Data matrix: represents n objects with p characteristics. Ex. person = {age, sex, income,...} Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.
Adriano Cruz *NCE e IM - UFRJ Cluster 32Dissimilarities Dissimilarity measures some form of distance between objects. Clustering algorithms use dissimilarities to cluster data. How can dissimilarities be measured?
Adriano Cruz *NCE e IM - UFRJ Cluster 33 How to calculate dissimilarities? The most popular methods are based on the distance between pairs of objects. Minkowski distance: p is the number of characteristics q is the distance type q=2 (Euclides distance), q=1 (Manhattan)
Adriano Cruz *NCE e IM - UFRJ Cluster 34Similarities It is also possible to work with similarities [s(x i,x j )] 0<=s(x i,x j )<=1 s(x i,x i )=1 s(x i,x j )=s(x j,x i ) It is possible to consider that d(x i,x j )=1- s(x i,x j )
Adriano Cruz *NCE e IM - UFRJ Cluster 35Distances
Adriano Cruz *NCE e IM - UFRJ Cluster 36Dissimilarities There are other ways to obtain dissimilarities. So we no longer speak of distances. Basically dissimilarities are nonnegative numbers (d(i,j)) that are small (close to 0) when i and j are similar.
Adriano Cruz *NCE e IM - UFRJ Cluster 37Pearson Pearson product-moment correlation between variables f and g Coefficients lie between –1 and +1
Adriano Cruz *NCE e IM - UFRJ Cluster 38 Pearson - cont A correlation of +1 means that there is a perfect positive linear relationship between variables. A correlation of -1 means that there is a perfect negative linear relationship between variables. A correlation of 0 means there is no linear relationship between the two variables.
Adriano Cruz *NCE e IM - UFRJ Cluster 39 Pearson - ex ryz = ; ryw = ; ryr=
Adriano Cruz *NCE e IM - UFRJ Cluster 40 Correlation and dissimilarities 1 d(f,g)=(1-R(f,g))/2 (1) Variables with a high positive correlation (+1) receive a dissimilarity close to 0 Variables with strongly negative correlation will be considered very dissimilar
Adriano Cruz *NCE e IM - UFRJ Cluster 41 Correlation and dissimilarities 2 d(f,g)=1-|R(f,g)| (2) Variables with a high positive correlation (+1) and negative correlation will receive a dissimilarity close to 0
Adriano Cruz *NCE e IM - UFRJ Cluster 42 Numerical Example NameWeightHeightMonthYear Ilan Jack Kim Lieve Leon Peter Talia Tina
Adriano Cruz *NCE e IM - UFRJ Cluster 43 Numerical Example NameWeightHeightMonthYear Ilan Jack Kim Lieve Leon Peter Talia Tina
Adriano Cruz *NCE e IM - UFRJ Cluster 44 Numerical Example 1 QuantiWeightHeightMonthYear CorrWeight1 Height Month Year DissWeight0 (1)Height Month Year DissWeight0 (2)Height Month Year
Adriano Cruz *NCE e IM - UFRJ Cluster 45 Binary Variables Binary variables have only two states. States can be symmetric or asymmetric. Binary variables are symmetric if both states are equally valuable. Ex. gender When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)
Adriano Cruz *NCE e IM - UFRJ Cluster 46 Contingency tables Consider objects described by p binary variables q variables are equal to one on i and j r variables are 1 on i and 0 on object j
Adriano Cruz *NCE e IM - UFRJ Cluster 47 Symmetric Variables Dissimilarity based on symmetric variables is invariant. The result should not change when variables are interchanged. Simple dissimilarity coefficient:
Adriano Cruz *NCE e IM - UFRJ Cluster 48 Symmetric Variables Dissimilarity Similarity
Adriano Cruz *NCE e IM - UFRJ Cluster 49 Asymmetric Variables Similarity based on asymmetric variables is not invariant. Two ones are more important than two zeros Jacard coefficient:
Adriano Cruz *NCE e IM - UFRJ Cluster 50 Computing dissimilarities
Adriano Cruz *NCE e IM - UFRJ Cluster 51 Computing Dissimilarities JackMary q 1,1 r 1,0 s 0,1 t 0,0 FeverYY1000 CoughNN0001 Test1PP1000 Test2NN0001 Test3NP0010 Test4NN
Adriano Cruz *NCE e IM - UFRJ Cluster 52 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.
Adriano Cruz *NCE e IM - UFRJ Cluster 53 Nominal Variables A nominal variable is a generalisation of the binary variable. A nominal variable can take more than two states Ex. Marital status: married, single, divorced Each state can be represented by a number or letter There is no specific ordering
Adriano Cruz *NCE e IM - UFRJ Cluster 54 Computing dissimilarities Consider two objects i and j, described by nominal variables Each object has p characteristics m is the number of matches
Adriano Cruz *NCE e IM - UFRJ Cluster 55 Binarising nominal variables An nominal variable can encoded to create a new binary variable for each state Example: Marital state = {married, single, divorced} Married: 1=yes – 0=no Single: 1=yes – 0=no Divorced: 1=yes – 0=no Ex. Marital state = {married} married = 1, single = 0, divorced = 0
Adriano Cruz *NCE e IM - UFRJ Cluster 56 Ordinal variables A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence Ex. Bronze, silver and gold medals Ex. Assistant, associate, full member
Adriano Cruz *NCE e IM - UFRJ Cluster 57 Computing dissimilarities Consider n objects defined by a set of ordinal variables f is one of these ordinal variables and have M f states. These states define the ranking r f {1,…, M f }.
Adriano Cruz *NCE e IM - UFRJ Cluster 58 Steps to calculate dissimilarities Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }. Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation Dissimilarity can be computed using distance measures of interval-scaled variables
Adriano Cruz *NCE e IM - UFRJ Cluster 59 Ratio-scaled variables Variables on a non-linear scale, such as exponential To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled Treat as ordinal data and assume ranks as interval-scaled
Adriano Cruz *NCE e IM - UFRJ Cluster 60 Variables of mixed types One technique is to bring all variables onto a common scale of the interval [ ] Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is
Adriano Cruz *NCE e IM - UFRJ Cluster 61 Dissimilarity is between i and j is Variables of mixed types
Adriano Cruz *NCE e IM - UFRJ Cluster 62 The contribution of each variable is dependent on its type f is binary or nominal: f is interval-based: f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont
Clustering Methods
Adriano Cruz *NCE e IM - UFRJ Cluster 64 Classification types Clustering is an unsupervised method
Adriano Cruz *NCE e IM - UFRJ Cluster 65 Clustering Methods Partitioning Hierarchical Density-based Grid-based Model-based
Adriano Cruz *NCE e IM - UFRJ Cluster 66 Partitioning Methods Given n objects k partitions are created. Each partition must contain at least one element. It uses an iterative relocation technique to improve partitioning. Distance is the usual criterion.
Adriano Cruz *NCE e IM - UFRJ Cluster 67 Partitioning Methods cont. They work well for finding spherical-shaped clusters. They are not efficient on very large databases. K-means where each cluster is represented by the mean value of the objects in the cluster. K-medoids where each cluster is represented by an object near the centre of the cluster.
Adriano Cruz *NCE e IM - UFRJ Cluster 68 Hierarchical Methods Creates a hierarchical decomposition of the set Agglomerative approaches start with each object forming a separate group Merges objects or groups until all objects belong to one group or a termination condition occurs Divisive approaches starts with all objects in the same cluster Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs
Adriano Cruz *NCE e IM - UFRJ Cluster 69 Hierarchical Clustering cont Definition of cluster proximity. Min: most similar (sensitive to noise) Max: most dissimilar (break large clusters
Adriano Cruz *NCE e IM - UFRJ Cluster 70 Density-based methods Method creates clusters until the density in the neighbourhood exceeds some threshold Able to find clusters of arbitrary shapes
Adriano Cruz *NCE e IM - UFRJ Cluster 71 Grid-based methods Grid methods divide the object space into finite number of cells forming a grid-like structure. Cells that contain more than a certain number of elements are treated as dense. Dense cells are connected to form clusters. Fast processing time, independent of the number of objects. STING and CLIQUE are examples.
Adriano Cruz *NCE e IM - UFRJ Cluster 72 Model-based methods Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model. Statistical models SOM networks
Adriano Cruz *NCE e IM - UFRJ Cluster 73 Partition methods Given a database of n objects a partition method organises them into k clusters (k<= n) The methods try to minimise an objective function such as distance Similar objects are “close” to each other