Download presentation
Presentation is loading. Please wait.
Published bySuzanna Doyle Modified over 9 years ago
1
1 Effective Feature Selection Framework for Cluster Analysis of Microarray Data Gouchol Pok Computer Science Dept. Yanbian University China Keun Ho Ryu DB/Bioinformatics Lab Chungbuk Nat’l University Korea
2
2 Outline Background Motivation Proposed Method Experiments Conclusion
3
3 Feature Selection Definition: Process of selecting a subset of relevant features for building robust learning models Objectives: Alleviating the effect of the curse of dimensionality Enhancing generalization capability Speeding up learning process Improving model interpretability from Wikipedia: http://en.wikipedia.org/wiki/Feature_selection
4
4 Issues in Feature Selection How to compute the degree to which a feature is relevant with the class (discrimination) How to decide if a selected feature is redundant with other features (strongly correlated) How to select features so that classifying power is not diminished (increased) Removal of irrelevancy Removal of redundancy Maintain class-discriminating power
5
5 Selection Modes Univariate method: considers one feature at a time based on score rank measures are Correlation, Information measure, K-S statistic, etc Multivariate method: considers subsets of features altogether Bayesian and PCA based selection in principle, more powerful than univariate method, but not always in practice (Guyon2008)
6
6 Hard Case in Univariate method ( Guyon2008* ) *Adopted from Guyon’s tutorial at IPAM summer school
7
7 Proposed method: Motivation Method that fits 2-D microarray data typical forms: thousands of genes (rows) and hundreds of samples (columns) Multivariate approach Feature relevancy and redundancy are addressed simultaneously
8
8 System Flow samples genes
9
9 System Flow (cont.)
10
10 Methods: Step1 Perform column-based difference op. D i (N,M) = C(N,M) C i (N,1), i = 1,2,…, M Difference operator may depend on applications, e.g. Euclidean or Manhattan distance D i (N,M) contains class-specific info. w.r.t each gene genes
11
11 Methods: Step2 Apply thresholds Find kind of “emerging patterns” which contrast 2 classes Suppose 1, 2,…, j C1 and j+1, j+2, … M C2 Sort the values in each column of D i (N,M) 25%-threshold to the same class differences and 75%-threshold to the different class differences C1C1 C2C2 C1C1 C2C2 C1C1 C2C2 25%75%
12
12 Methods: Step3 Extract class-specific features Within-class summation of binary values (count 1’s) summation C1C1 C2C2
13
13 Methods: Step4 Gene selection Apply different threshold value for different class Gene selection: we are done for the row-wise reduction threshold
14
14 Methods: Step5 Column-wise reduction by clustering Classification of samples Applied NMF method
15
15 Nonnegative Matrix Factorization (NMF) Matrix factorization: A ~ VH A: n m matrix of n genes and m samples. V: (n k): k columns of V are called basis vectors H: (k m): describes how strongly each building block is present in measurement vectors = n m m n k k A VH
16
16 NMF: Parts-based Clustering (Brunet2004) Brunet introduce meta-genes concept
17
17 Experiments: Datasets Leukemia Data 5000 genes 38 samples of two classes 19 samples of ALL-B and 8 samples of ALL-T type, 11 samples of AML type. Medulloblastoma Data 5893 genes 34 samples of two classes 25 classic type and 9 desmoplastic medulloblastoma type Central Nervous System Tumors Data 7129 samples 34 samples of four classes 10 classic medulloblastomas, 10 malig-nant gliomas, 10 rhabdoids, and 4 normals
18
18 Classification Given a target sample, its class is predicted by the highest value in k-dim column vector of H = n m m n k k A VH
19
19 Results Leukemia Data (ALL-T vs. ALL-B vs. AML)
20
20 Results Medulloblastoma Data (Classic vs. Desmoplastic)
21
21 Results Central Nervous System Tumors Data (4 classes)
22
22 Conclusions & Future work Our approach tries to capture a group of features, but in contrast to holistic methods such as PCA and ICA, intrinsic structure of data distribution is preserved in the reduced space. Still, PCA and ICA can be used as an aide to look into the data distribution structure, and provide useful information for further processing to other methods. Our on-going research is on how to combine the PCA and ICA to the proposed work
23
23 References Wikipedia, http://en.wikipedia.org/wiki/Feature_selection J.-P. Brunet, P. Tamayo, T. Golub, and J. P. Mesirov. Metagenes and molecular pattern discovery using matrix factorization. PNAS, 101(12):4164-4169, 2004. L. Yu and H. Liu. Feature selection for high-dimensional data: A fast correlation-based filter solution. In Proc 12th Int Conf on Machine Learning (ICML-03), pages 856–863, 2003 Biesiada J, Duch W (2005), Feature Selection for High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. (CORES'05) Advances in Soft Computing, Springer Verlag, pp. 95- 104, 2005. D.D. Lee and H.S. Seung, Learning the parts of objects by nonnegative matrix factorization
24
24 Questions?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.