1 《智能信息处理》课程 第四讲 模糊信息处理技术( 4 ) 模糊聚类原理 2008 年 10 月 17 日 (星期五 3 、 4 节, 理教 110 )
2 Fuzzy Clustering What’s clustering? Some concepts Clustering Algorithms K-means method Fuzzy C-means (FCM) clustering method Hierarchical Clustering Algorithms Mixture of Gaussians Homework
3 What’s clustering ? Clustering can be considered the most important unsupervised learning problem, it deals with finding a structure in a collection of unlabeled data. Definition of clustering The process of organizing objects into groups whose members are similar in some way. A cluster is a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
4 a graphical example of clustering
5 It is easily to identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance (in this case geometrical distance). This is called distance-based clustering. Another kind of clustering is conceptual clustering: two or more objects belong to the same cluster if this one defines a concept common to all that objects.
6 Vehicle Example
7 Vehicle Clusters Top speed [km/h] Weight [kg] Sports cars Medium market cars Lorries
8 Terminology Top speed [km/h] Weight [kg] Sports cars Medium market cars Lorries Object or data point feature feature space cluster feature label
9 The Goals of Clustering To determine the intrinsic grouping in a set of unlabeled data. How to decide what constitutes a good clustering?
10 The Goals of Clustering ( 2 ) It can be shown that there is no absolute “best” criterion which would be independent of the final aim of the clustering. Consequently, it is the user which must supply this criterion, in such a way that the result of the clustering will suit their needs.
11 The Goals of Clustering ( 3 ) For instance, we could be interested in finding representatives for homogeneous groups (data reduction), in finding “natural clusters” and describe their unknown properties (“natural” data types), in finding useful and suitable groupings (“useful” data classes) or in finding unusual data objects (outlier detection).
12 Rich Applications of Clustering Pattern Recognition Spatial Data Analysis Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks Image Processing Economic Science (especially market research) WWW Document classification Cluster Weblog data to discover groups of similar access patterns
13 Examples of Clustering Applications Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs Land use: Identification of areas of similar land use in an earth observation database Insurance: Identifying groups of motor insurance policy holders with a high average claim cost City-planning: Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
14 What is Cluster Analysis? Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters Cluster analysis Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
15 Requirements of a clustering algorithm scalability; dealing with different types of attributes; discovering clusters with arbitrary shape; minimal requirements for domain knowledge to determine input parameters; ability to deal with noise and outliers; insensitivity to order of input records; high dimensionality; interpretability and usability.
16 Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
17 Problems current clustering techniques do not address all the requirements adequately (and concurrently); dealing with large number of dimensions and large number of data items can be problematic because of time complexity; the effectiveness of the method depends on the definition of “distance” (for distance-based clustering); if an obvious distance measure doesn’t exist we must “define” it, which is not always easy, especially in multi- dimensional spaces; the result of the clustering algorithm (that in many cases can be arbitrary itself) can be interpreted in different ways.
18 Clustering Algorithms Clustering algorithms may be classified as listed below: Exclusive Clustering Overlapping Clustering Hierarchical Clustering Probabilistic Clustering
19 Exclusive Clustering Data are grouped in an exclusive way, so that if a certain datum belongs to a definite cluster then it could not be included in another cluster. A simple example of that is shown in the figure below, where the separation of points is achieved by a straight line on a bi- dimensional plane.
20
21 Overlapping clustering Overlapping clustering uses fuzzy sets to cluster data, so that each point may belong to two or more clusters with different degrees of membership. In this case, data will be associated to an appropriate membership value.
22 Hierarchical Clustering A hierarchical clustering algorithm is based on the union between the two nearest clusters. The beginning condition is realized by setting every datum as a cluster. After a few iterations it reaches the final clusters wanted.
23 Probabilistic Clustering Probabilistic clustering uses a completely probabilistic approach for clustering the data in hand.
24 Four most used clustering algorithms K-means Fuzzy C-means Hierarchical clustering Mixture of Gaussians
25 Distance Measure An important component of a clustering algorithm is the distance measure between data points. If the components of the data instance vectors are all in the same physical units then it is possible that the simple Euclidean distance metric is sufficient to successfully group similar data instances. However, even in this case the Euclidean distance can sometimes be misleading.
26
27 K-Means Clustering K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem.MacQueen, 1967 The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
28 K-Means Clustering K-means (MacQueen, 1967) is one of the simplest unsupervised learning algorithms that solve the well known clustering problem. The procedure follows a simple and easy way to classify a given data set through a certain number of clusters (assume k clusters) fixed a priori.MacQueen, 1967 The main idea is to define k centroids, one for each cluster. These centroids shoud be placed in a cunning way because of different location causes different result. So, the better choice is to place them as much as possible far away from each other. The next step is to take each point belonging to a given data set and associate it to the nearest centroid. When no point is pending, the first step is completed and an early groupage is done. At this point we need to re-calculate k new centroids as barycenters of the clusters resulting from the previous step. After we have these k new centroids, a new binding has to be done between the same data set points and the nearest new centroid. A loop has been generated. As a result of this loop we may notice that the k centroids change their location step by step until no more changes are done. In other words centroids do not move any more. Finally, this algorithm aims at minimizing an objective function, in this case a squared error function. The objective function
29 Partitioning Algorithms: Basic Concept Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen ’ 67): Each cluster is represented by the center of the cluster k-medoids (Kaufman & Rousseeuw ’ 87): Each cluster is represented by one of the objects in the cluster
30 The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when no more new assignment
31 The K-Means Clustering Method Example K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means reassign
32 Comments on the K-Means Method Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Comparing: PAM: O(k(n-k) 2 ), CLARA: O(ks 2 + k(n-k)) Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes
33 Fuzzy C-Means Clustering Fuzzy c-means (FCM) is a method of clustering which allows one piece of data to belong to two or more clusters. This method (developed by Dunn in 1973 and improved by Bezdek in 1981) is frequently used in pattern recognition. It is based on minimization of the following objective function:Dunn in 1973Bezdek in 1981 where m is any real number greater than 1, u ij is the degree of membership of x i in the cluster j, x i is the ith of d-dimensional measured data, c j is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center.
34 Fuzzy partitioning is carried out through an iterative optimization of the objective function shown above, with the update of membership u ij and the cluster centers c j by This iteration will stop when
35 FCM’s Steps 1.Initialize U=[u ij ] matrix, U (0) 2.At k-step: calculate the centers vectors C (k) =[c j ] with U (k) 3.Update U (k), U (k+1) 4.If || U (k+1) - U (k) ||< then STOP; otherwise return to step 2.
36 Remarks As already told, data are bound to each cluster by means of a Membership Function, which represents the fuzzy behavior of this algorithm. To do that, we simply have to build an appropriate matrix named U whose factors are numbers between 0 and 1, and represent the degree of membership between data and centers of clusters.
37 A 1-D example
38 matrix U Now, instead of using a graphical representation, we introduce a matrix U whose factors are the ones taken from the membership functions: (a) (b) The number of rows and columns depends on how many data and clusters we are considering. More exactly we have C = 2 columns (C = 2 clusters) and N rows.
39 Other properties
40 A 1-D application of the FCM Figures below show the membership value for each datum and for each cluster.
41 In the simulation, we have used a fuzzyness coefficient m = 2 and we have also imposed to terminate the algorithm when. The picture shows the initial condition where the fuzzy distribution depends on the particular position of the clusters. No step is performed yet so that clusters are not identified very well. Now we can run the algorithm until the stop condition is verified. The figure below shows the final condition reached at the 8th step with m=2 and =0.3:
42 Is it possible to do better? Certainly, we could use an higher accuracy but we would have also to pay for a bigger computational effort. In the figure below we can see a better result having used the same initial conditions and =0.01, but we needed 37 steps!
43 Hierarchical Clustering Algorithms Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering (defined by S.C. Johnson in 1967) is this:S.C. Johnson in Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2.Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3.Compute distances (similarities) between the new cluster and each of the old clusters. 4.Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*)
44 Algorithm Steps 1.Begin with the disjoint clustering having level L(0) = 0 and sequence number m = 0. 2.Find the least dissimilar pair of clusters in the current clustering, say pair (r), (s), according to d[(r),(s)] = min d[(i),(j)] where the minimum is over all pairs of clusters in the current clustering. 3.Increment the sequence number : m = m +1. Merge clusters (r) and (s) into a single cluster to form the next clustering m. Set the level of this clustering to L(m) = d[(r),(s)] 4.Update the proximity matrix, D, by deleting the rows and columns corresponding to clusters (r) and (s) and adding a row and column corresponding to the newly formed cluster. The proximity between the new cluster, denoted (r,s) and old cluster (k) is defined in this way: d[(k), (r,s)] = min d[(k),(r)], d[(k),(s)] 5.If all objects are in one cluster, stop. Else, go to step 2.
45 agglomerative / divisive This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively. There is also a divisive hierarchical clustering which does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces.
46 Example a hierarchical clustering of distances in kilometers between some Italian cities
47 Input distance matrix BAFIMINARMTO BA FI MI NA RM TO
48 MI , TO merged into MI/TO BAFIMI/TONARM BA FI MI/TO NA RM
49 merge NA and RM into a new NA/RM cluster BAFIMI/TONA/RM BA FI MI/TO NA/RM
50 BA/FI/NA/RMMI/TO BA/FI/NA/RM0295 MI/TO2950
51 Hierarchical tree
52 Clustering as a Mixture of Gaussians a model-based approach, which consists in using certain models for clusters and attempting to optimize the fit between the data and the model. Each cluster can be mathematically represented by a parametric distribution, like a Gaussian (continuous) or a Poisson (discrete). The entire data set is therefore modeled by a mixture of these distributions. An individual distribution used to model a specific cluster is often referred to as a component distribution
53 A mixture model with high likelihood tends to have the following traits: component distributions have high “peaks” (data in one cluster are tight); the mixture model “covers” the data well (dominant patterns in the data are captured by component distributions). Main advantages of model-based clustering: well-studied statistical inference techniques available; flexibility in choosing the component distribution; obtain a density estimation for each cluster; a “soft” classification is available.
54 Mixture of Gaussians
55 The algorithm works in the following way: it chooses the component (the Gaussian) at random with probability ; it samples a point. Let ’ s suppose to have: x 1, x 2,..., x N We can obtain the likelihood of the sample:. What we really want to maximise is (probability of a datum given the centres of the Gaussians).
56 is the base to write the likelihood function: Now we should maximise the likelihood function by calculating but it would be too difficult. That ’ s why we use a simplified algorithm called EM (Expectation-Maximization).
57 References Tariq Rashid: “Clustering” Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering” Pier Luca Lanzi: “Ingegneria della Conoscenza e Sistemi Esperti – Lezione 2: Apprendimento non supervisionato” %20Apprendimento%20non%20supervisionato.pdf %20Apprendimento%20non%20supervisionato.pdf J. C. Dunn (1973): "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters", Journal of Cybernetics 3: J. C. Bezdek (1981): "Pattern Recognition with Fuzzy Objective Function Algoritms", Plenum Press, New York Tariq Rashid: “Clustering” Hans-Joachim Mucha and Hizir Sofyan: “Nonhierarchical Clustering” A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood from Incomplete Data via theEM algorithm", Journal of the Royal Statistical Society, Series B, vol. 39, 1:1-38 Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering” Jia Li: “Data Mining - Clustering by Mixture Models”
58 Homework 1. 为什么需要聚类分析?它有什么作用? 2. 请列举出一些聚类算法的应用领域,并 简要说明。 3. 实现 FCM 算法,并用它处理一个 2-D 数 据的聚类问题,给出实验结果。
59 谢谢!