Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Regression Eric Feigelson Lecture and R tutorial Arcetri Observatory April 2014.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Dimension reduction (1)
Chap 9: Testing Hypotheses & Assessing Goodness of Fit Section 9.1: INTRODUCTION In section 8.2, we fitted a Poisson dist’n to counts. This chapter will.
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Stochastic Differentiation Lecture 3 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO Working Group on Continuous.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Clustering.
Fast Evolutionary Optimisation Temi avanzati di Intelligenza Artificiale - Lecture 6 Prof. Vincenzo Cutello Department of Mathematics and Computer Science.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Using ranking and DCE data to value health states on the QALY scale using conventional and Bayesian methods Theresa Cain.
Nonlinear Stochastic Programming by the Monte-Carlo method Lecture 4 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO.
Radial Basis Function Networks
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Principles of Pattern Recognition
A statistical model Μ is a set of distributions (or regression functions), e.g., all uni-modal, smooth distributions. Μ is called a parametric model if.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Monte-Carlo method for Two-Stage SLP Lecture 5 Leonidas Sakalauskas Institute of Mathematics and Informatics Vilnius, Lithuania EURO Working Group on Continuous.
Data Modeling Patrice Koehl Department of Biological Sciences National University of Singapore
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Machine Learning 5. Parametric Methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Week 21 Order Statistics The order statistics of a set of random variables X 1, X 2,…, X n are the same random variables arranged in increasing order.
G. Cowan Lectures on Statistical Data Analysis Lecture 9 page 1 Statistical Data Analysis: Lecture 9 1Probability, Bayes’ theorem 2Random variables and.
Bootstrapping James G. Anderson, Ph.D. Purdue University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.
Week 21 Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution that produced.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Canadian Bioinformatics Workshops
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Deep Feedforward Networks
12. Principles of Parameter Estimation
Model Inference and Averaging
Ch3: Model Building through Regression
Classification of unlabeled data:
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
REMOTE SENSING Multispectral Image Classification
Statistical Models for Automatic Speech Recognition
POINT ESTIMATOR OF PARAMETERS
10701 / Machine Learning Today: - Cross validation,
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Simple Linear Regression
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
12. Principles of Parameter Estimation
Presentation transcript:

Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)

The problem Let X = X N be a sample of size N supposed to satisfy d-dimen- sional Gaussian mixture model (d is supposed to be large). Because of large dimension it is natural to project the sample to k- dimensional (k = 1, 2,…) linear subspaces using projection pursuit method (Huber (1985), Friedman (1987)) which gives the best selection of these subspaces. If distribution of standardized sample on the complement space becomes standard Gaussian, this linear subspace H is called discriminant subspace. E. g., if we have q Gaussian mixture components with equal covariance matrices then dimension of the discriminant subspace is q–1. Having an estimate of the discriminant subspace we can perform much easier classification using projected sample.

The sequential procedure applied to the standardized sample is the following (k = 1, 2,…, until the hypothesis of discriminant subspace holds for some k): 1.Find the best k-dimensional linear subspace using projection pursuit method (Rudzkis and Radavičius (1999)). 2.Fit a Gaussian mixture model for the sample projected to the k- dimensional linear subspace (Rudzkis and Radavičius (1995)). 3.Test goodness-of-fit of the estimated d-dimensional model assuming that distribution on the complement space is standard Gaussian. If the test fails then increase k and go to the step 1. The problem in step 1 is to find basic vectors in high-dimension space (we do not cover this problem by now). The problem in step 3 (in common approach) is comparing some non-parametric density estimate with parametric one in high-dimensional space.

We present a simple, data-driven and computationally efficient procedure for testing goodness-of-fit. The procedure is based on well-known interpretation of testing goodness-of-fit as the classi- fication problem, a special sequential data partition procedure, randomization and resampling, elements of sequential testing. Monte-Carlo simulations are used to assess the performance of the procedure. This procedure can be applied to the testing of independence of components in high-dimensional data. We present some preliminary computer simulation results.

Introduction Let Consider general classification problem of estimation of a posteriori probabilities from the sample

Under these assumptions we have Usually the EM algorithm is used to estimate the a posteriori probabilities. Denote then EM algorithm is a following iterative procedure: EM algoritm converges to some local maximum of the maximum likelihood function

which usually is not equal to the global maximum Let for some subspace the following equality holds: where

and the subspace H has a maximum dimension, then this subspace is called the discriminant subspace. We do not lose an information on the a posteriori probabilities when we project the sample to the discriminant subspace. We can get the estimate of the discriminant subspace using projection pursuit procedure (see e. g., J. H. Friedman (1987), S. A. Aivazyan (1996), R. Rudzkis, M. Radavičius (1998)).

Test statistics Letbe a sample of the size N of i.i.d. random vectors with a common distribution function F on R d. distributions. Consider a nonparametric hypothesis testing problem: Letandbe two disjoint classes of d-dimensional LetConsider a mixture model

of two populations  H and  with d.f. F H and F, respecti- vely. Fix p and let Y = Y (p) denote a random vector with the mixture distribution F (p). Let Z = Z (p) be the posterior proba- bility of the population  given Y, i.e. Here f and f H denote distribution densities of F and F H, respecti- vely. Let us introduce a loss function l(F, F H ) = E(Z – p) 2.

be a sequence of partitions of R d, possibly dependent on Y, and let be the corresponding sequence of  -algebras generated by these partitions. A computationally efficient choice of P is the sequential dyadic coordinate-wise partition minimizing at each step the mean square error. Let X (H) = {X (H) (1), X (H) (2),…, X (H) (M)} be a sample of size M of i.i.d. vectors from  H. It is also supposed that X (H) is independent of X. Set

In view of the definition of the loss function a natural choice of the test statistics would be  2 -type statistics for somewhich can be treated as a smoothing parameter. Here E MN stands for the expectation with respect to the empirical distribution F of Y. However, since the optimal value of k is unknown, we prefer the following definition of the test statistics: where a k and b k are centering and scaling parameters to be specified.

We have selected the following test statistics: where sample X (H) ) in the jth area of the kth partition P k.

Illustration of the sequential dyadic partitioning procedure Here we have an example (at some step) of sequential partitioning procedure with two samples of two-dimen- sional data. The next partition is selected from all current squares and all divisions by each dimen- sion (in this case d=2) to achieve minimum mean square error of grouping.

Preliminary simulation results The computer simulations have been performed using Monte- Carlo simulation method (typically 100 independent simulations). Sample sizes of X and X (H) were selected equal (typically N = M = 1000). The first problem is to evaluate using the computer simulation the test statistics T k in case when the hypothesis H holds. Centering and scaling parameters of the test statistics were selected in such a way that distribution of the test statistics is approximately standard Gaussian for each k not very close to 1 and K. The computer simulation results show that for very wide range of dimensions, sample sizes and distributions behaviour of the test statistics in case when the hypothesis H holds is very similar.

Fig. 1. Behaviour of T k when the hypothesis holds Here we have sample size N=1000, dimension d=100, and two samples of d-dimensional standard Gaussian distribution. We have maxima and minima of 100 realizations and corresponding maxima and minima except of 5 per cent largest values at each point.

Fig. 2. Behaviour of T k when the hypothesis does not hold Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –3, 0, 0, 0,…), (0, 6, 0, 0, 0,…), (4, –3, 0, 0, 0,…). The sample is projected to one-dimensional subspace. This is an extremely unfit situation.

Fig. 3. Behaviour of T k (control data) This is a control example for the data in Fig. 2 assuming that we project data to the true two-dimensional discriminant subspace.

Fig. 4. Behaviour of T k when the hypothesis does not hold Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –1, 0, 0, 0,…), (0, 2, 0, 0, 0,…), (4, –1, 0, 0, 0,…). The sample is projected to one-dimensional subspace.

Fig. 5. Behaviour of T k (control data) This is a control example for the data in Fig. 4 assuming that we project data to the true two-dimensional discriminant subspace.

Fig. 6. Behaviour of T k when the hypothesis does not hold Here we have sample size N=1000, dimension d=10, q=3, Gaussian mixture with means (–4, –0.5, 0, 0, 0,…), (0, 1, 0, 0, 0,…), (4, –0.5, 0, 0, 0,…). The sample is projected to one-dimensional subspace.

Fig. 7. Behaviour of T k (control data) This is a control example for the data in Fig. 6 assuming that we project data to the true two-dimensional discriminant subspace.

Fig. 8. Behaviour of T k when the hypothesis does not hold Here we have sample size N=1000, dimension d=20, and standard Cauchy distribution. Sample X H is simulated with independent components, number of independent components are d 1 = d/2, d 2 = d/2.

Fig. 9. Behaviour of T k (control data) This is a control example for the data in Fig. 8 assuming that the sample X (H) is simulated as sample with the same distribution as the sample X.

Fig. 10. Behaviour of T k when the hypothesis does not hold Here we have sample size N=1000, dimension d=10, and Student distribution with 3 degrees of freedom. Number of independent components are d 1 = 1, d 2 = d–1.

Fig. 11. Behaviour of T k (control data) This is a control example for the data in Fig. 10 assuming that the sample X (H) is simulated as sample with the same distribution as the sample X.

end.