On statistical models of cluster stability Z. Volkovich a, b, Z

Slides:



Advertisements
Similar presentations
Introduction to Hypothesis Testing
Advertisements

Applications of one-class classification
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Chapter 10: The t Test For Two Independent Samples
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Introduction to Bioinformatics
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
1 By Gil Kalai Institute of Mathematics and Center for Rationality, Hebrew University, Jerusalem, Israel presented by: Yair Cymbalista.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Statistical Methods Chichang Jou Tamkang University.
1 On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a, L. Morozensky a a. Software Engineering Department, ORT Braude College of.
Inferences About Process Quality
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Chapter 1 Basics of Probability.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Chapter 8 Estimation Nutan S. Mishra Department of Mathematics and Statistics University of South Alabama.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Trust-Aware Optimal Crowdsourcing With Budget Constraint Xiangyang Liu 1, He He 2, and John S. Baras 1 1 Institute for Systems Research and Department.
An Empirical Likelihood Ratio Based Goodness-of-Fit Test for Two-parameter Weibull Distributions Presented by: Ms. Ratchadaporn Meksena Student ID:
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
Yaomin Jin Design of Experiments Morris Method.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
The Dirichlet Labeling Process for Functional Data Analysis XuanLong Nguyen & Alan E. Gelfand Duke University Machine Learning Group Presented by Lu Ren.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Eurostat Weighting and Estimation. Presented by Loredana Di Consiglio Istituto Nazionale di Statistica, ISTAT.
New Measures of Data Utility Mi-Ja Woo National Institute of Statistical Sciences.
Confidence intervals. Estimation and uncertainty Theoretical distributions require input parameters. For example, the weight of male students in NUS follows.
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Point Estimation of Parameters and Sampling Distributions Outlines:  Sampling Distributions and the central limit theorem  Point estimation  Methods.
Data Mining and Decision Support
1 ON STATISTICAL MODEL OF CLUSTER STABILITY Z. Volkovich Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel; Department.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
STA248 week 121 Bootstrap Test for Pairs of Means of a Non-Normal Population – small samples Suppose X 1, …, X n are iid from some distribution independent.
Clustering (2) Center-based algorithms Fuzzy k-means Density-based algorithms ( DBSCAN as an example ) Evaluation of clustering results Figures and equations.
Machine Learning: Ensemble Methods
Chapter 6: Sampling Distributions
Logic of Hypothesis Testing
Chapter 2 Sets and Functions.
CHAPTER 3 SETS, BOOLEAN ALGEBRA & LOGIC CIRCUITS
Chapter 7. Classification and Prediction
ESTIMATION.
Virtual University of Pakistan
Dr.MUSTAQUE AHMED MBBS,MD(COMMUNITY MEDICINE), FELLOWSHIP IN HIV/AIDS
Chapter 6: Sampling Distributions
STATISTICS Random Variables and Distribution Functions
Sample Mean Distributions
Chapter 8: Inference for Proportions
Machine Learning Basics
CORRELATION ANALYSIS.
Testing a Claim About a Mean:  Known
Lecture 11 Nonparametric Statistics Introduction
Lecture Slides Elementary Statistics Eleventh Edition
Introduction to Instrumentation Engineering
Hidden Markov Models Part 2: Algorithms
REMOTE SENSING Multispectral Image Classification
Chapter 9 Hypothesis Testing.
POINT ESTIMATOR OF PARAMETERS
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Hypothesis Testing and Confidence Intervals
Analyzing Reliability and Validity in Outcomes Assessment
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Conditional Probability, Total Probability Theorem and Bayes’ Rule
Presentation transcript:

On statistical models of cluster stability Z. Volkovich a, b, Z On statistical models of cluster stability Z. Volkovich a, b, Z. Barzily a , L. Morozensky a a. Software Engineering Department, ORT Braude College of Engineering, Karmiel, Israel b. Department of Mathematics and Statistics, The University of Maryland, Baltimore County, UMBC, Baltimore, USA

What is Clustering? Clustering deals the partitioning of a data set to groups of elements which are similar to each other. A group membership is determined by means of a distance-like function that measures the resembling between two data points.

Goal of the paper In the current paper we present a method for assessing cluster stability. This method, combined with a clustering algorithm, yields an estimate of a data partition, namely, the number of clusters and the attributes of each cluster.

Concept of the paper The basic idea of our method is that if one ”properly” clusters, two independent samples then, under the assumption of a consistent clustering algorithm, the clustered samples can be classified as two samples drawn from the same population.

The Model Conclusion: The substance we are dealing with belongs to the subject of the hypothesis testing. As no prior knowledge of the distribution of the population is available thus, a distribution-free two- sample test can be applied.

Two-sample test Which two-sample tests can be used for our purpose? There are several possibilities. We consider the two-sample test built on negative definite kernels approach proposed by A.A. Zinger, A.V. Kakosyan and L.B. Klebanov, 1989 and L. Klebanov, 2003. This approach is very similar to the one proposed by G. Zech, B. Aslan, 2005.. Applications for distribution’s characterization of these distances were also discussed by L. Klebanov, T. Kozubowskii, S. Rachev and V. Volkovich, 2001.

Negative Definite Kernels A real symmetric function N is negative definite, if for any n ≥ 1 any x1, .., xn Є X for any real numbers c1, .., cn such that The kernel is called strongly negative definite, if the equality in this relationship is reached only if ci = 0, i = 1, .., n .

Example Functions of the type φ(x) = ||x||r , 0 < r ≤ 2, produce negative definite kernels, which are strongly negative definite if 0 < r < 2. It is important to note that a negative definite kernel, N2, can be obtained from a negative definite kernel, N1, by the transformations N2 = N1α , 0 < α < 1 and N2 = ln(1−N1).

Negative Definite Kernel test We restrict ourself to the hard clustering situation, where the partition is defined by a set of associations In this case, the underlying distribution of X is where are cluster probabilities and are the inner clusters distributions .

Negative Definite Kernel test (2) We consider kernels N(x1, x2, c1, c2) = Nx(x1, x2) χ(c1=c2) , where Nx(x1, x2) is a negative definite kernel and χ(c1=c2) is an indicator function of the event {c1=c2}. Formally speaking, this kernel is not a Negative definite kernel. However, a distance can be constructed as: and Dis(μ, ν) = L(μ, μ) + L(ν, ν) − 2L(μ, ν) .

Negative Definite Kernel test (3) Theorem. Let N(x1, x2, c1, c2) be a negative definite kernel described above and let μ and ν be two measures satisfying (*) such that Pμ(c|x) = Pν(c|x), then • Dis(μ, ν) ≥ 0; • If Nx is a strongly negative definite function then Dis(μ, ν) = 0 if and only if μ = ν.

Against the alternative Negative Definite Kernel test (4) Let S1: x1, x2, …, xn and S2: y1, y2, …, yn be two samples of independent random vectors having probability laws F and G respectively. We are willing to test the hypothesis Against the alternative when the distributions F and G are unknown.

Algorithm description Let us suppose that a hard clustering algorithm Cl, based on the probability model, is available. Input parameters: a clustered sample S and a predefined number of clusters k. Output parameters: clustered sample S(k) = (S, Ck) consisting of a vector Ck of the cluster labels of S. For two given disjoint samples S1 and S2 we consider a clustered sample (S1 U S2, Ck) and denote by c the mapping from this clustered sample to Ck .

Algorithm description (2) Let us introduce where |Cj| is the size of the cluster number j :

Algorithm description (3) The algorithm consists of the following steps :

Algorithm description (4) Remarks about the algorithm : Need for standardization (Step 6): The clustering algorithm may not determine the correct cluster for an outlier. This adds noise to the result. The noise level decreases in k since less data elements are assigned to distant centroids. Standardization decreases the noise level. Choice of the optimal k as the most concentrated (Step 8): If k is less than the “true” number of clusters then at least one cluster is formed by uniting two separate clusters thus, is less concentrated. If k is larger than the “true” number of clusters then at least one cluster is formed in a location where there is a random concentration of data elements in the sample. This, again, decreases the concentration of because two clusters are not likely to have the same random concentration.

Numerical experiments In order to evaluate the performance of the described methodology we provide several numerical experiments on synthetic and real datasets. The selected samples (steps 3 and 4 of the algorithm) are clustered by applying the K-Means algorithm. The results obtained are used as inputs for steps 4 and 5 of the algorithm. The quality of the k* partitions is evaluated (step 7 of the algorithm) by three concentration statistics: the Friedman’s Index, The KL-distance and the Kurtosis.

Numerical experiments (2) We demonstrate the performance of our algorithm by comparing our clustering results to the ”true” structure of the real datasets. This dataset is chosen from the text collections available at http://www.dcs.gla.ac.uk/idom/ir resources/test collections/. The set consists of the following three collections: • DC0–Medlars Collection (1033 medical abstracts). • DC1–CISI Collection (1460 information science abstracts). • DC2–Cranfield Collection (1400 aerodynamics abstracts).

Numerical experiments (3) Following the ”bag of words” well known approach, 300 and 600 “best” terms were selected, and the thirty leading Principal components were found. In the case when number of the samples and size of the samples are equal 1000 for K(x,y)=||x-y||2 we obtained 7 6 5 4 3 2 0.0647 0.0867 0.0919 0.0299 0.0150 0.0193 Friedman index 0.1711 0.1633 0.1541 0.0767 0.0558 0.0522 KL-distance 7.2073 5.6070 4.6352 3.0289 2.4233 3.3275 Kurtosis

Numerical experiments (4) We can see that two of the indexes indicate three clusters in the data Thank you