PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini

Slides:



Advertisements
Similar presentations
Hypothesis Testing Steps in Hypothesis Testing:
Advertisements

Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Stat 301 – Day 17 Tests of Significance. Last Time – Sampling cont. Different types of sampling and nonsampling errors  Can only judge sampling bias.
MCS Multiple Classifier Systems, Cagliari 9-11 June Giorgio Valentini Random aggregated and bagged ensembles.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) – FastMap Dimensionality Reductions or data projections.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Two Sample Hypothesis Testing for Proportions
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
10-1 Introduction 10-2 Inference for a Difference in Means of Two Normal Distributions, Variances Known Figure 10-1 Two independent populations.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Differentially expressed genes
9-1 Hypothesis Testing Statistical Hypotheses Statistical hypothesis testing and confidence interval estimation of parameters are the fundamental.
Clustering (Part II) 10/07/09. Outline Affinity propagation Quality evaluation.
Stat 350 Lab Session GSI: Yizao Wang Section 016 Mon 2pm30-4pm MH 444-D Section 043 Wed 2pm30-4pm MH 444-B.
Evaluating Hypotheses
Hypothesis Testing for Population Means and Proportions
CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.
Experimental Evaluation
Chapter 11: Inference for Distributions
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Chapter 9 Hypothesis Testing.
BCOR 1020 Business Statistics Lecture 20 – April 3, 2008.
Statistical Comparison of Two Learning Algorithms Presented by: Payam Refaeilzadeh.
5-3 Inference on the Means of Two Populations, Variances Unknown
Statistical Methods in Computer Science Hypothesis Testing II: Single-Factor Experiments Ido Dagan.
Hypothesis Testing.
Hypothesis Testing and T-Tests. Hypothesis Tests Related to Differences Copyright © 2009 Pearson Education, Inc. Chapter Tests of Differences One.
Chapter 13: Inference in Regression
Week 9 Chapter 9 - Hypothesis Testing II: The Two-Sample Case.
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Ch 10 Comparing Two Proportions Target Goal: I can determine the significance of a two sample proportion. 10.1b h.w: pg 623: 15, 17, 21, 23.
Chapter 9.3 (323) A Test of the Mean of a Normal Distribution: Population Variance Unknown Given a random sample of n observations from a normal population.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Chapter Inference on the Least-Squares Regression Model and Multiple Regression 14.
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
14 Elements of Nonparametric Statistics
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
9-1 Hypothesis Testing Statistical Hypotheses Definition Statistical hypothesis testing and confidence interval estimation of parameters are.
Ch9. Inferences Concerning Proportions. Outline Estimation of Proportions Hypothesis concerning one Proportion Hypothesis concerning several proportions.
1 SMU EMIS 7364 NTU TO-570-N Inferences About Process Quality Updated: 2/3/04 Statistical Quality Control Dr. Jerrell T. Stracener, SAE Fellow.
Copyright © 2013, 2010 and 2007 Pearson Education, Inc. Section Inference about Two Means: Independent Samples 11.3.
Confidence intervals and hypothesis testing Petter Mostad
Large sample CI for μ Small sample CI for μ Large sample CI for p
1 9 Tests of Hypotheses for a Single Sample. © John Wiley & Sons, Inc. Applied Statistics and Probability for Engineers, by Montgomery and Runger. 9-1.
Academic Research Academic Research Dr Kishor Bhanushali M
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 14 Comparing Groups: Analysis of Variance Methods Section 14.1 One-Way ANOVA: Comparing.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Chapter5: Evaluating Hypothesis. 개요 개요 Evaluating the accuracy of hypotheses is fundamental to ML. - to decide whether to use this hypothesis - integral.
: An alternative representation of level of significance. - normal distribution applies. - α level of significance (e.g. 5% in two tails) determines the.
Copyright © Cengage Learning. All rights reserved. 12 Analysis of Variance.
1 URBDP 591 A Lecture 12: Statistical Inference Objectives Sampling Distribution Principles of Hypothesis Testing Statistical Significance.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Statistical Inference Drawing conclusions (“to infer”) about a population based upon data from a sample. Drawing conclusions (“to infer”) about a population.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 10-1 Chapter 10 Two-Sample Tests and One-Way ANOVA Business Statistics, A First.
One-way ANOVA Example Analysis of Variance Hypotheses Model & Assumptions Analysis of Variance Multiple Comparisons Checking Assumptions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 11 Inference for Distributions of Categorical.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 7 Inferences Concerning Means.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Significance Test for the Difference of Two Proportions
Two-Sample Hypothesis Testing
Chapter 4. Inference about Process Quality
Chapter 9 Hypothesis Testing.
Chapter 8 Section 8.5 Testing µ1 - µ2 and p1 - p2 Independent Samples Hypothesis Testing Mr. zboril | Milford PEP.
I. Statistical Tests: Why do we use them? What do they involve?
Presentation transcript:

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini ~ valenti Model order selection for clustered bio-molecular data DSI - Dipartimento di Scienze dell’Informazione Università degli Studi di Milano

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 2 Motivations and objectives Bio-medical motivations: Finding “robust” subclasses of pathologies using bio-molecular data. Discovering reliable structures in high-dimensional bio- molecular data. More general motivations: Assessing the reliability of clusterings discovered in high dimensioanl data Estimating the significance of the discovered clusterings Objectives: Development of stability-based methods designed to discover structures in high-dimensional bio-molecular data Development of methods to find multiple and hierarchical structures in the data Assessing the significance of the solutions through the application of statistical tests in the context of unsupervied model order selection problems.

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 3 Model order selection through stability-based procedures In this conceptual framework multiple clusterings are obtained by introducing perturbations (e.g. subsampling, BenHur et al, 2002; noise injection, Mc Shane et al, 2003) into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. A general stability based procedure to estimate the reliability of a given clustering: 1.Randomly perturb the data many times according to a given perturbation procedure. 2.Apply a given clustering algorithm to the perturbed data 3.Apply a given clustering similarity measure (e.g. Jaccard similarity) to multiple pairs of k-clusterings obtained according to steps 1 and 2. 4.Use the similarity measures to assess the stability of a given clustering. 5.Repeat steps 1 to 4 for multiple values of k and select the most stable clustering(s) as the most reliable.

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 4 A stability based method based on random projections (1) Data perturbation through a randomized mapping, such that for every pair : An example of a randomized mapping (Plus-Minus-one randomized map, Achlioptas, 2001): In (Bertoni and Valentini, 2006) we proposed to choose d’ according to the Johnson-Lindenstrauss (JL) lemma (1984): Given a data set D with |D|=n examples there exists a  -distortion embedding into R d’ with d’=c log n/  2, where c is a suitable constant. Using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the structure of the original data

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 5 A stability based method based on random projections (2): the MOSRAM algorithm MOSRAM (Model Order Selection by Randomized Maps): Input: D: a dataset; k max : max number of clusters; n: number of pairs of random projections;  a randomized map; Clust: a clustering algorithm; sim : a clustering similarity measure. Output: M(i,k): list of similarity measures for each k (1≤i≤n, 2≤k≤k max ) begin for k:=2 to k max do for i:=1 to n do proj a :=  (D) proj b :=  (D) C a := Clust(proj a, k) C b := Clust(proj b, k) M(i,k) := sim(C a,C b ) endfor end.

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 6 Using the distribution of the similarities to estimate the stability S k (0≤ S k ≤1) is the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm to pairs of random independently perturbed data. The intuitive idea is that if S k is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. f k (s) is the density function of Sk. We have: g(k) is a parameter of concentration (BenHur et al. 2002) We may observe the following facts: E[S k ] can be used as a good index of the reliability of the k-clusterings E[S k ] may be estimated through the empirical means  k : where Note that we use the overall distribution of the the similarity measures to assess the stability of the k-clusterings, where  is a randomized perturbation procedure.

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 7 A  2 -based method to estimate the significance of the discovered clusterings (1) We may perform a sorting of the : p is the index permutation such that For each k-clustering, we consider two groups of pairwise clustering similarities values separated by a threshold t o. Thus we may obtain: P(S k >t o ) = 1- F(S k =t o ) x k = P(S k >t o )n is the number of times for which the similarity values are larger than t o, where n is the number of repeated similarity measurements. Hence x k may be interpreted as the successes from a binomial population with parameter  k. Setting X k as a random variable that counts how many times S k >t o, we have: the unknown  k is estimated through its pooled estimate We can compute the following statistic:

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 8 A  2 -based method to estimate the significance of the discovered clusterings (2) Using the previous Y statistic we can test the following alternative hypotheses: - Ho: all the  k are equal to  (the considered set of k-clusterings are equally reliable) - Ha: the  k are not all equal between them (the considered set of k-clusterings are not equally reliable) If we may reject the null hypothesis at  significance level, that is we may conclude that with probability 1-  the considered proportions are different, and hence that at least one k-clustering significantly differs from the others. The test is iterated until no significant difference of the similarities between the k- clusterings is detected: Using the above test we start considering all the k-clustering. If a difference at  significance level is registered according to the statistical test we exclude the last clustering (according to the sorting of  k ) and we repeat the test with the remaining k- clusterings. This process is iterated until no significant difference is detected: the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at  significance level).

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 9 Experiments with high dimensional synthetic data (I) 1000-dimensional synthetic data data distributed according to a multivariate gaussian distribution 2 or 6 clusters of data (as highlighted by the PCA projection to the two principal components) Histograms of the similarity measures obtained by applying PAM clustering to 100 pairs of PMO projections from 1000 to 471-dimensional subspaces (  =0.2):

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 10 2 and 6 clusters are selected at 0.01 significance level Experiments with high dimensional synthetic data (II) Similarity k p-value mean variance Empirical cumulative distribution of the similarity measures for different k-clusterings Sorting according to the means

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 11 Detection of multiple structures 3,6 and 12 clusters are selected at 0.01 significance level k p-value mean variance e e e e e e e e e e e e e e e e e e e e e e e e e e e-02 Empirical cumulative distribution of the similarity measures for different k-clusterings

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 12 Discovering significant structures in bio-molecular data (Leukemia data, Golub et al. 1999) 2 and 3 clusters are selected at 0.01 significance level Similarity k p-value mean variance e e e e e e e e Empirical cumulative distribution of the similarity measures for different k-clusterings

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 13 Comparison with other methods Methods Class. risk (Lange et al., 2004) Gap statistic (Tibshirani et al. 2001) Clest (Dudoit and Fridlyand, 2002) Figure of Merit (Levine& Domany, 2001) Model Explorer (BenHur et al. 2002) MOS- RAM “True” number k Data set Leukemia (Golub et al., 1999) k=3k=10k=3k=2,8,10k=2k=2,3 Lymphoma (Alizadeh et al, 2000) k=2k=4k=2k=2,9k=2 k=2,(3)* * Note that the subdivision of Lymphoma samples in 3 classes (DLBCL, CLL and FL) is performed on histopathological and morphological basis and this classification does not seem to correspond to the bio-molecular classification (Alizadeh et al., 2000)

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 14 Conclusions The proposed stability method based on random projections is well-suited to discover structures in high- dimensional bio-medical data. The reliability of the discovered k-clusterings may be estimated exploiting the distribution of the clustering pairwise similarities, and a  2 -based statistical test tailored to unsupervised model order selection. The  2 -based test assumes that the random variables are normally distributed. We are developing a new distribution-independent approach based on the Bernstein inequality to assess the significance of the discovered k- clusterings.