1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Nat’l University Korea

2 Outline  Background  Motivation  Proposed Method  Experiments  Conclusion

3 Feature Selection  Definition: Process of selecting a subset of relevant features for building robust learning models  Objectives: Alleviating the effect of the curse of dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection

4 Issues in Feature Selection  How to compute the degree to which a feature is relevant with the class (discrimination)  How to decide if a selected feature is redundant with other features (strongly correlated)  How to select features so that classifying power is not diminished (increased)  Removal of irrelevancy  Removal of redundancy  Maintain class-discriminating power

5 Selection Modes  Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure, K-S statistic, etc  Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method, but not always in practice (Guyon2008)

6 Hard Case in Univariate method ( Guyon2008* ) *Adopted from Guyon’s tutorial at IPAM summer school

7 Proposed method: Motivation  Method that fits 2-D microarray data typical forms: thousands of genes (rows) and hundreds of samples (columns)  Multivariate approach Feature relevancy and redundancy are addressed simultaneously

8 System Flow samples genes

9 System Flow (cont.)

10 Methods: Step1  Perform column-based difference op.  D i (N,M) = C(N,M)  C i (N,1), i = 1,2,…, M Difference operator may depend on applications, e.g. Euclidean or Manhattan distance D i (N,M) contains class-specific info. w.r.t each gene genes

11 Methods: Step2  Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j  C1 and j+1, j+2, … M  C2 Sort the values in each column of D i (N,M) 25%-threshold to the same class differences and 75%-threshold to the different class differences C1C1 C2C2 C1C1 C2C2 C1C1 C2C2 25%75%

12 Methods: Step3  Extract class-specific features Within-class summation of binary values (count 1’s) summation C1C1 C2C2

13 Methods: Step4  Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction threshold

14 Methods: Step5  Column-wise reduction by clustering Classification of samples Applied NMF method

15 Nonnegative Matrix Factorization (NMF)  Matrix factorization: A ~ VH A: n  m matrix of n genes and m samples. V: (n  k): k columns of V are called basis vectors H: (k  m): describes how strongly each building block is present in measurement vectors = n m m n k k A VH

16 NMF: Parts-based Clustering (Brunet2004)  Brunet introduce meta-genes concept

17 Experiments: Datasets  Leukemia Data 5000 genes 38 samples of two classes  19 samples of ALL-B and 8 samples of ALL-T type,  11 samples of AML type.  Medulloblastoma Data 5893 genes 34 samples of two classes  25 classic type and 9 desmoplastic medulloblastoma type  Central Nervous System Tumors Data 7129 samples 34 samples of four classes  10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals

18 Classification  Given a target sample, its class is predicted by the highest value in k-dim column vector of H = n m m n k k A VH

19 Results  Leukemia Data (ALL-T vs. ALL-B vs. AML)

20 Results  Medulloblastoma Data (Classic vs. Desmoplastic)

21 Results  Central Nervous System Tumors Data (4 classes)

22 Conclusions & Future work  Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space.  Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA and ICA to the proposed work

23 References  Wikipedia, http://en.wikipedia.org/wiki/Feature_selection  J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004.  L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003  Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95- 104, 2005.  D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization

24 Questions?

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

Similar presentations

Presentation on theme: "1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu.

Similar presentations

Presentation on theme: "1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu."— Presentation transcript:

Similar presentations

About project

Feedback