Presentation is loading. Please wait.

Presentation is loading. Please wait.

PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini

Similar presentations


Presentation on theme: "PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini"— Presentation transcript:

1 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini {bertoni,valentini}@dsi.unimi.it http://homes.dsi.unimi.it/ ~ valenti Model order selection for clustered bio-molecular data DSI - Dipartimento di Scienze dell’Informazione Università degli Studi di Milano

2 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 2 Motivations and objectives Bio-medical motivations: Finding “robust” subclasses of pathologies using bio-molecular data. Discovering reliable structures in high-dimensional bio- molecular data. More general motivations: Assessing the reliability of clusterings discovered in high dimensioanl data Estimating the significance of the discovered clusterings Objectives: Development of stability-based methods designed to discover structures in high-dimensional bio-molecular data Development of methods to find multiple and hierarchical structures in the data Assessing the significance of the solutions through the application of statistical tests in the context of unsupervied model order selection problems.

3 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 3 Model order selection through stability-based procedures In this conceptual framework multiple clusterings are obtained by introducing perturbations (e.g. subsampling, BenHur et al, 2002; noise injection, Mc Shane et al, 2003) into the original data, and a clustering is considered reliable if it is approximately maintained across multiple perturbations. A general stability based procedure to estimate the reliability of a given clustering: 1.Randomly perturb the data many times according to a given perturbation procedure. 2.Apply a given clustering algorithm to the perturbed data 3.Apply a given clustering similarity measure (e.g. Jaccard similarity) to multiple pairs of k-clusterings obtained according to steps 1 and 2. 4.Use the similarity measures to assess the stability of a given clustering. 5.Repeat steps 1 to 4 for multiple values of k and select the most stable clustering(s) as the most reliable.

4 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 4 A stability based method based on random projections (1) Data perturbation through a randomized mapping, such that for every pair : An example of a randomized mapping (Plus-Minus-one randomized map, Achlioptas, 2001): In (Bertoni and Valentini, 2006) we proposed to choose d’ according to the Johnson-Lindenstrauss (JL) lemma (1984): Given a data set D with |D|=n examples there exists a  -distortion embedding into R d’ with d’=c log n/  2, where c is a suitable constant. Using randomized maps that obey the JL lemma, we may perturb the data introducing only bounded distortions, approximately preserving the structure of the original data

5 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 5 A stability based method based on random projections (2): the MOSRAM algorithm MOSRAM (Model Order Selection by Randomized Maps): Input: D: a dataset; k max : max number of clusters; n: number of pairs of random projections;  a randomized map; Clust: a clustering algorithm; sim : a clustering similarity measure. Output: M(i,k): list of similarity measures for each k (1≤i≤n, 2≤k≤k max ) begin for k:=2 to k max do for i:=1 to n do proj a :=  (D) proj b :=  (D) C a := Clust(proj a, k) C b := Clust(proj b, k) M(i,k) := sim(C a,C b ) endfor end.

6 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 6 Using the distribution of the similarities to estimate the stability S k (0≤ S k ≤1) is the random variable given by the similarity between two k-clusterings obtained by applying a clustering algorithm to pairs of random independently perturbed data. The intuitive idea is that if S k is concentrated close to 1, the corresponding clustering is stable with respect to a given controlled perturbation and hence it is reliable. f k (s) is the density function of Sk. We have: g(k) is a parameter of concentration (BenHur et al. 2002) We may observe the following facts: E[S k ] can be used as a good index of the reliability of the k-clusterings E[S k ] may be estimated through the empirical means  k : where Note that we use the overall distribution of the the similarity measures to assess the stability of the k-clusterings, where  is a randomized perturbation procedure.

7 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 7 A  2 -based method to estimate the significance of the discovered clusterings (1) We may perform a sorting of the : p is the index permutation such that For each k-clustering, we consider two groups of pairwise clustering similarities values separated by a threshold t o. Thus we may obtain: P(S k >t o ) = 1- F(S k =t o ) x k = P(S k >t o )n is the number of times for which the similarity values are larger than t o, where n is the number of repeated similarity measurements. Hence x k may be interpreted as the successes from a binomial population with parameter  k. Setting X k as a random variable that counts how many times S k >t o, we have: the unknown  k is estimated through its pooled estimate We can compute the following statistic:

8 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 8 A  2 -based method to estimate the significance of the discovered clusterings (2) Using the previous Y statistic we can test the following alternative hypotheses: - Ho: all the  k are equal to  (the considered set of k-clusterings are equally reliable) - Ha: the  k are not all equal between them (the considered set of k-clusterings are not equally reliable) If we may reject the null hypothesis at  significance level, that is we may conclude that with probability 1-  the considered proportions are different, and hence that at least one k-clustering significantly differs from the others. The test is iterated until no significant difference of the similarities between the k- clusterings is detected: Using the above test we start considering all the k-clustering. If a difference at  significance level is registered according to the statistical test we exclude the last clustering (according to the sorting of  k ) and we repeat the test with the remaining k- clusterings. This process is iterated until no significant difference is detected: the set of the remaining (top sorted) k-clusterings represents the set of the estimate stable number of clusters discovered (at  significance level).

9 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 9 Experiments with high dimensional synthetic data (I) 1000-dimensional synthetic data data distributed according to a multivariate gaussian distribution 2 or 6 clusters of data (as highlighted by the PCA projection to the two principal components) Histograms of the similarity measures obtained by applying PAM clustering to 100 pairs of PMO projections from 1000 to 471-dimensional subspaces (  =0.2):

10 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 10 2 and 6 clusters are selected at 0.01 significance level Experiments with high dimensional synthetic data (II) Similarity k p-value mean variance 2 ---- 1.0000 0.0000 6 1.0000 1.0000 0.0000 7 0.0000 0.9217 0.0016 8 0.0000 0.8711 0.0033 9 0.0000 0.8132 0.0042 5 0.0000 0.8090 0.0104 3 0.0000 0.8072 0.0157 10 0.0000 0.7715 0.0056 4 0.0000 0.7642 0.0158 Empirical cumulative distribution of the similarity measures for different k-clusterings Sorting according to the means

11 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 11 Detection of multiple structures 3,6 and 12 clusters are selected at 0.01 significance level k p-value mean variance 3 -------- 1.0000 0.0000e+00 6 1.0000e+00 0.9979 1.6185e-05 12 1.0000e+00 0.9907 8.0657e-05 13 6.9792e-03 0.9809 2.8658e-04 14 2.2928e-06 0.9754 3.3594e-04 15 0.0000e+00 0.9580 6.8150e-04 7 0.0000e+00 0.9435 2.3055e-03 8 0.0000e+00 0.8954 4.6829e-03 5 0.0000e+00 0.8947 1.5433e-02 11 0.0000e+00 0.8897 3.2340e-03 9 0.0000e+00 0.8706 6.9421e-03 10 0.0000e+00 0.8691 5.0763e-03 4 0.0000e+00 0.8609 9.3463e-03 2 0.0000e+00 0.8532 2.3234e-02 Empirical cumulative distribution of the similarity measures for different k-clusterings

12 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 12 Discovering significant structures in bio-molecular data (Leukemia data, Golub et al. 1999) 2 and 3 clusters are selected at 0.01 significance level Similarity k p-value mean variance 2 --------- 0.8285 0.0077 3 7.3280e-01 0.8060 0.0124 4 2.3279e-06 0.6589 0.0060 5 9.5199e-11 0.6012 0.0073 6 6.3282e-15 0.5424 0.0057 7 0.0000e+00 0.5160 0.0062 8 0.0000e+00 0.4865 0.0050 9 0.0000e+00 0.4819 0.0060 10 0.0000e+00 0.4744 0.0049 Empirical cumulative distribution of the similarity measures for different k-clusterings

13 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 13 Comparison with other methods Methods Class. risk (Lange et al., 2004) Gap statistic (Tibshirani et al. 2001) Clest (Dudoit and Fridlyand, 2002) Figure of Merit (Levine& Domany, 2001) Model Explorer (BenHur et al. 2002) MOS- RAM “True” number k Data set Leukemia (Golub et al., 1999) k=3k=10k=3k=2,8,10k=2k=2,3 Lymphoma (Alizadeh et al, 2000) k=2k=4k=2k=2,9k=2 k=2,(3)* * Note that the subdivision of Lymphoma samples in 3 classes (DLBCL, CLL and FL) is performed on histopathological and morphological basis and this classification does not seem to correspond to the bio-molecular classification (Alizadeh et al., 2000)

14 PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 14 Conclusions The proposed stability method based on random projections is well-suited to discover structures in high- dimensional bio-medical data. The reliability of the discovered k-clusterings may be estimated exploiting the distribution of the clustering pairwise similarities, and a  2 -based statistical test tailored to unsupervised model order selection. The  2 -based test assumes that the random variables are normally distributed. We are developing a new distribution-independent approach based on the Bernstein inequality to assess the significance of the discovered k- clusterings.


Download ppt "PMSB 2006, Tuusula (Finland) A. Bertoni, G.Valentini, DSI - Univ. Milano 1 Alberto Bertoni, Giorgio Valentini"

Similar presentations


Ads by Google