Download presentation
Presentation is loading. Please wait.
Published byScarlett Robeson Modified over 9 years ago
1
Clustering: Introduction Adriano Joaquim de O Cruz ©2002 NCE/UFRJ adriano@nce.ufrj.br
2
Introduction
3
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 3 What is cluster analysis? The process of grouping a set of physical or abstract objects into classes of similar objects. The class label of each class is unknown. Classification separates objects into classes when the labels are known.
4
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 4 What is cluster analysis? cont. Clustering is a form of learning by observations. Neural Networks learn by examples. Unsupervised learning.
5
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 5Applications In business helps to discover distinct groups of customers. In data mining used to gain insight into the distribution of data, to observe the characteristics of each cluster. Pre-processing step for classification. Pattern recognition.
6
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 6Requirements Scalability: work with large databases. Ability to deal with different types of attributes (not only interval based data). Clusters of arbitrary shape, not only spherical. Minimal requirements about domain. Ability do deal with noisy data.
7
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 7 Requirements cont. Insensitivity to the order of input records. Work with samples of high dimensionality. Constrained-based clustering Interpretability and usability: results should be easily interpretable.
8
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 8 Sensitivity to Input Order Some algorithms are sensitive to the order of input data Leader algorithm is an example Ellipse: 2 1 3 5 4 6; Triangle: 1 2 6 4 5 3
9
Clustering Techniques
10
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 10 Heuristic Clustering Techniques Incomplete or heuristic clustering: geometrical methods or projection techniques. Dimension reduction techniques (e.g. PCA) are used obtain a graphical representation in two or three dimensions. Heuristic methods based on visualisation are used to determine the clusters.
11
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 11 Deterministic Crisp Clustering Each datum will be assigned to only one cluster. Each cluster partition defines a ordinary partition of the data set.
12
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 12 Overlapping Crisp Clustering Each datum will be assigned to at least one cluster. Elements may belong to more than one cluster at various degrees.
13
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 13 Probabilistic Clustering For each element, a probabilistic distribution over the clusters is determined. The distribution specifies the probability with which a datum is assigned to a cluster. If the probabilities are interpreted as degree of membership then these are fuzzy clustering techniques.
14
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 14 Possibilistic Clustering Degrees of membership or possibility indicate to what extent a datum belongs to the clusters. Possibilistic cluster analysis drops the constraint that the sum of memberships of each datum to all clusters is equal to one.
15
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 15 Hierarchical Clustering Descending techniques: they divide the data into more fine-grained classes. Ascending techniques: they combine small classes into more coarse-grained ones.
16
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 16 Objective Function Clustering An objective function assigns to each cluster partition values that have to be optimised. This is strictly an optimisation problem.
17
Data Types
18
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 18 Data Types Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Binary variables have only two states. Ex. smoker, fever, client, owner. Nominal variables are a generalisation of a binary variable with m states. Ex. Map colour, Marital state.
19
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 19 Data Types cont. Ordinal variables are ordered nominal variables. Ex. Olympic medals, Professional ranks. Ratio-scaled variables have a non-linear scale. Ex. Growth of a bacteria population
20
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 20 Interval-scaled variables Interval-scaled variables are continuous measurements of a linear scale. Ex. height, weight, temperature. Interval-scaled variables are dependent on the units used. Measurement unit can affect analysis, so standardisation should be used.
21
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 21Problems Person Age (yr) Height (cm) A35190 B40190 C35160 D40160
22
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 22Standardisation Converting original measurements to unitless values. Attempts to give all variables the equal weight. Useful when there is no prior knowledge of the data.
23
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 23 Standardisation algorithm Z-scores indicate how far and in what direction an item deviates from its distribution's mean, expressed in units of its distribution's standard deviation. The transformed scores will have a mean of zero and standard deviation of one. It is useful when comparing relative standings of items from distributions with different means and/or different standard deviation.
24
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 24 Standardisation algorithm Consider n values of a variable x. Calculate the mean value. Calculate the standard deviation. Calculate the z-score.
25
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 25 Z-scores example
26
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 26 Real heights and ages charts
27
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 27 Z-scores for heights and ages
28
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 28 Data chart
29
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 29 Data chart
30
Similarities
31
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 31 Data Matrices Data matrix: represents n objects with p characteristics. Ex. person = {age, sex, income,...} Dissimilarity matrix: represents a collection of dissimilarities between all pairs of objects.
32
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 32Dissimilarities Dissimilarity measures some form of distance between objects. Clustering algorithms use dissimilarities to cluster data. How can dissimilarities be measured?
33
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 33 How to calculate dissimilarities? The most popular methods are based on the distance between pairs of objects. Minkowski distance: p is the number of characteristics q is the distance type q=2 (Euclides distance), q=1 (Manhattan)
34
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 34Similarities It is also possible to work with similarities [s(x i,x j )] 0<=s(x i,x j )<=1 s(x i,x i )=1 s(x i,x j )=s(x j,x i ) It is possible to consider that d(x i,x j )=1- s(x i,x j )
35
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 35Distances
36
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 36Dissimilarities There are other ways to obtain dissimilarities. So we no longer speak of distances. Basically dissimilarities are nonnegative numbers (d(i,j)) that are small (close to 0) when i and j are similar.
37
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 37Pearson Pearson product-moment correlation between variables f and g Coefficients lie between –1 and +1
38
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 38 Pearson - cont A correlation of +1 means that there is a perfect positive linear relationship between variables. A correlation of -1 means that there is a perfect negative linear relationship between variables. A correlation of 0 means there is no linear relationship between the two variables.
39
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 39 Pearson - ex ryz = 0.9861; ryw = -0.9551; ryr= 0.2770
40
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 40 Correlation and dissimilarities 1 d(f,g)=(1-R(f,g))/2 (1) Variables with a high positive correlation (+1) receive a dissimilarity close to 0 Variables with strongly negative correlation will be considered very dissimilar
41
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 41 Correlation and dissimilarities 2 d(f,g)=1-|R(f,g)| (2) Variables with a high positive correlation (+1) and negative correlation will receive a dissimilarity close to 0
42
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 42 Numerical Example NameWeightHeightMonthYear Ilan1595182 Jack49156555 Kim13951181 Lieve45160756 Leon85178648 Peter66176656 Talia12901283 Tina1078184
43
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 43 Numerical Example NameWeightHeightMonthYear Ilan1595182 Jack49156555 Kim13951181 Lieve45160756 Leon85178648 Peter66176656 Talia12901283 Tina1078184
44
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 44 Numerical Example 1 QuantiWeightHeightMonthYear CorrWeight1 Height0.9571 Month-0.0360.0211 Year-0.953-0.9850.0131 DissWeight0 (1)Height0.0210 Month0.5180.4890 Year0.9770.9920.4930 DissWeight0 (2)Height0.0430 Month0.9640.9790 Year0.0470.0150.9870
45
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 45 Binary Variables Binary variables have only two states. States can be symmetric or asymmetric. Binary variables are symmetric if both states are equally valuable. Ex. gender When the states are not equally important the variable is asymmetric. Ex. disease tests (1-positive; 0-negative)
46
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 46 Contingency tables Consider objects described by p binary variables q variables are equal to one on i and j r variables are 1 on i and 0 on object j
47
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 47 Symmetric Variables Dissimilarity based on symmetric variables is invariant. The result should not change when variables are interchanged. Simple dissimilarity coefficient:
48
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 48 Symmetric Variables Dissimilarity Similarity
49
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 49 Asymmetric Variables Similarity based on asymmetric variables is not invariant. Two ones are more important than two zeros Jacard coefficient:
50
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 50 Computing dissimilarities
51
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 51 Computing Dissimilarities JackMary q 1,1 r 1,0 s 0,1 t 0,0 FeverYY1000 CoughNN0001 Test1PP1000 Test2NN0001 Test3NP0010 Test4NN0001 2013
52
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 52 Computing dissimilarities Jim and Mary have the highest dissimilarity value, so they have low probability of having the same disease.
53
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 53 Nominal Variables A nominal variable is a generalisation of the binary variable. A nominal variable can take more than two states Ex. Marital status: married, single, divorced Each state can be represented by a number or letter There is no specific ordering
54
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 54 Computing dissimilarities Consider two objects i and j, described by nominal variables Each object has p characteristics m is the number of matches
55
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 55 Binarising nominal variables An nominal variable can encoded to create a new binary variable for each state Example: Marital state = {married, single, divorced} Married: 1=yes – 0=no Single: 1=yes – 0=no Divorced: 1=yes – 0=no Ex. Marital state = {married} married = 1, single = 0, divorced = 0
56
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 56 Ordinal variables A discrete ordinal variable is similar to a nominal variable, except that the states are ordered in a meaningful sequence Ex. Bronze, silver and gold medals Ex. Assistant, associate, full member
57
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 57 Computing dissimilarities Consider n objects defined by a set of ordinal variables f is one of these ordinal variables and have M f states. These states define the ranking r f {1,…, M f }.
58
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 58 Steps to calculate dissimilarities Assume that the value of f for the ith object is x if. Replace each x if by its corresponding rank r if g {1,…,M f }. Since the number of states of each variable differs, it is often necessary map the range onto [0.0,1.0] using the equation Dissimilarity can be computed using distance measures of interval-scaled variables
59
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 59 Ratio-scaled variables Variables on a non-linear scale, such as exponential To compute dissimilarities there are three methods Treat as interval-scaled. Not always good. Treat as interval-scaled. Not always good. Apply a transformation like y=log(x) and treat as interval-scaled Apply a transformation like y=log(x) and treat as interval-scaled Treat as ordinal data and assume ranks as interval-scaled Treat as ordinal data and assume ranks as interval-scaled
60
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 60 Variables of mixed types One technique is to bring all variables onto a common scale of the interval [0.0.1.0] Suppose that the data set contains p variables of mixed type. Dissimilarity is between i and j is
61
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 61 Dissimilarity is between i and j is Variables of mixed types
62
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 62 The contribution of each variable is dependent on its type f is binary or nominal: f is interval-based: f is ordinal of ratio-scaled: compute ranks and treat as interval-based Variables of mixed types cont
63
Clustering Methods
64
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 64 Classification types Clustering is an unsupervised method
65
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 65 Clustering Methods Partitioning Hierarchical Density-based Grid-based Model-based
66
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 66 Partitioning Methods Given n objects k partitions are created. Each partition must contain at least one element. It uses an iterative relocation technique to improve partitioning. Distance is the usual criterion.
67
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 67 Partitioning Methods cont. They work well for finding spherical-shaped clusters. They are not efficient on very large databases. K-means where each cluster is represented by the mean value of the objects in the cluster. K-medoids where each cluster is represented by an object near the centre of the cluster.
68
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 68 Hierarchical Methods Creates a hierarchical decomposition of the set Agglomerative approaches start with each object forming a separate group Merges objects or groups until all objects belong to one group or a termination condition occurs Divisive approaches starts with all objects in the same cluster Each successive iteration splits a cluster until all objects are on separate clusters or a termination condition occurs
69
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 69 Hierarchical Clustering cont Definition of cluster proximity. Min: most similar (sensitive to noise) Max: most dissimilar (break large clusters
70
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 70 Density-based methods Method creates clusters until the density in the neighbourhood exceeds some threshold Able to find clusters of arbitrary shapes
71
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 71 Grid-based methods Grid methods divide the object space into finite number of cells forming a grid-like structure. Cells that contain more than a certain number of elements are treated as dense. Dense cells are connected to form clusters. Fast processing time, independent of the number of objects. STING and CLIQUE are examples.
72
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 72 Model-based methods Model-based methods hypothesise a model for each cluster and find the best fit of the data to the given model. Statistical models SOM networks
73
*@2001 Adriano Cruz *NCE e IM - UFRJ Cluster 73 Partition methods Given a database of n objects a partition method organises them into k clusters (k<= n) The methods try to minimise an objective function such as distance Similar objects are “close” to each other
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.