Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.

Similar presentations


Presentation on theme: "1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of."— Presentation transcript:

1 1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel b. Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, USA

2 2 What is Clustering? Clustering deals the partitioning of a data set to groups of elements which are similar to each other. A group membership is determined by means of a distance-like function that measures the resembling between two data points.

3 3 Goal of the paper In the current paper we present a method for assessing cluster stability. This method, combined with a clustering algorithm, yields an estimate of a data partition, namely, the number of clusters and the attributes of each cluster.

4 4 Concept of the paper The basic idea of our method is that if one ”properly” clusters, two independent samples then, under the assumption of a consistent clustering algorithm, the clustered samples can be classified as two samples drawn from the same population.

5 5 The Model Conclusion: The substance we are dealing with belongs to the subject of the hypothesis testing. As no prior knowledge of the distribution of the population is available thus, a distribution-free two- sample test can be applied.

6 6 Two-sample test Which two-sample tests can be used for our purpose? There are several possibilities. We consider the two-sample test built on negative definite kernels approach proposed by A.A. Zinger, A.V. Kakosyan and L.B. Klebanov, 1989 and L. Klebanov, 2003. This approach is very similar to the one proposed by G. Zech, B. Aslan, 2005.. Applications for distribution’s characterization of these distances were also discussed by L. Klebanov, T. Kozubowskii, S. Rachev and V. Volkovich, 2001.

7 7 Negative Definite Kernels A real symmetric function N is negative definite, if for any n ≥ 1 any x 1,.., x n Є X for any real numbers c 1,.., c n such that The kernel is called strongly negative definite, if the equality in this relationship is reached only if c i = 0, i = 1,.., n.

8 8 Example Functions of the type φ(x) = ||x|| r, 0 < r ≤ 2, produce negative definite kernels, which are strongly negative definite if 0 < r < 2. It is important to note that a negative definite kernel, N 2, can be obtained from a negative definite kernel, N 1, by the transformations N 2 = N 1 α, 0 < α < 1 and N 2 = ln(1−N 1 ).

9 9 We restrict ourself to the hard clustering situation, where the partition is defined by a set of associations In this case, the underlying distribution of X is where are cluster probabilities and are the inner clusters distributions. Negative Definite Kernel test

10 10 Negative Definite Kernel test (2) We consider kernels N(x 1, x 2, c 1, c 2 ) = N x (x 1, x 2 ) χ(c 1 =c 2 ), where N x (x 1, x 2 ) is a negative definite kernel and χ(c 1 =c 2 ) is an indicator function of the event {c 1 =c 2 }. Formally speaking, this kernel is not a Negative definite kernel. However, a distance can be constructed as: and Dis(μ, ν) = L(μ, μ) + L(ν, ν) − 2L(μ, ν).

11 11 Negative Definite Kernel test (3) Theorem. Let N(x 1, x 2, c 1, c 2 ) be a negative definite kernel described above and let μ and ν be two measures satisfying (*) such that P μ (c|x) = P ν (c|x), then Dis(μ, ν) ≥ 0; If N x is a strongly negative definite function then Dis(μ, ν) = 0 if and only if μ = ν.

12 12 Let S 1 : x 1, x 2, …, x n and S 2 : y 1, y 2, …, y n be two samples of independent random vectors having probability laws F and G respectively. We are willing to test the hypothesis Against the alternative when the distributions F and G are unknown. Negative Definite Kernel test (4)

13 13 Algorithm description Let us suppose that a hard clustering algorithm Cl, based on the probability model, is available. Input parameters: a clustered sample S and a predefined number of clusters k. Output parameters: clustered sample S(k) = (S, C k ) consisting of a vector C k of the cluster labels of S. For two given disjoint samples S 1 and S 2 we consider a clustered sample (S 1 U S 2, C k ) and denote by c the mapping from this clustered sample to C k.

14 where |C j | is the size of the cluster number j : Algorithm description (2) Let us introduce

15 15 Algorithm description (3) The algorithm consists of the following steps :

16 16 Algorithm description (4) Remarks about the algorithm : 1. Need for standardization (Step 6): i. The clustering algorithm may not determine the correct cluster for an outlier. This adds noise to the result. ii. The noise level decreases in k since less data elements are assigned to distant centroids. iii. Standardization decreases the noise level. 2. Choice of the optimal k as the most concentrated (Step 8): i. If k is less than the “true” number of clusters then at least one cluster is formed by uniting two separate clusters thus, is less concentrated. ii. If k is larger than the “true” number of clusters then at least one cluster is formed in a location where there is a random concentration of data elements in the sample. This, again, decreases the concentration of because two clusters are not likely to have the same random concentration.

17 17 Numerical experiments In order to evaluate the performance of the described methodology we provide several numerical experiments on synthetic and real datasets. The selected samples (steps 3 and 4 of the algorithm) are clustered by applying the K-Means algorithm. The results obtained are used as inputs for steps 4 and 5 of the algorithm. The quality of the k* partitions is evaluated (step 7 of the algorithm) by three concentration statistics: the Friedman’s Index, The KL-distance and the Kurtosis.

18 Numerical experiments (2) We demonstrate the performance of our algorithm by comparing our clustering results to the ”true” structure of the real datasets. This dataset is chosen from the text collections available at http://www.dcs.gla.ac.uk/idom/ir resources/test collections/. The set consists of the following three collections: DC0–Medlars Collection (1033 medical abstracts). DC1–CISI Collection (1460 information science abstracts). DC2–Cranfield Collection (1400 aerodynamics abstracts).

19 19 Numerical experiments (3) 765432 0.06470.08670.09190.0299 0.0150 0.0193Friedman index 0.17110.16330.15410.0767 0.0558 0.0522KL-distance 7.20735.60704.63523.0289 2.4233 3.3275Kurtosis Following the ”bag of words” well known approach, 300 and 600 “best” terms were selected, and the thirty leading Principal components were found. In the case when number of the samples and size of the samples are equal 1000 for K(x,y)=||x-y|| 2 we obtained

20 20 Numerical experiments (4) We can see that two of the indexes indicate three clusters in the data Thank you Thank you


Download ppt "1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of."

Similar presentations


Ads by Google