An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai.

Slides:

Advertisements

Similar presentations

Outlines Background & motivation Algorithms overview

Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

CSCE555 Bioinformatics Lecture 15 classification for microarray data Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:

Achim Tresch Computational Biology ‘Omics’ - Analysis of high dimensional Data.

Yue Han and Lei Yu Binghamton University.

Minimum Redundancy and Maximum Relevance Feature Selection

Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.

Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005.

Gene selection using Random Voronoi Ensembles Stefano Rovetta Department of Computer and Information Sciences, University of Genoa, Italy Francesco masulli.

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton.

Mutual Information Mathematical Biology Seminar

Decision Tree Algorithm

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini

Reduced Support Vector Machine

‘Gene Shaving’ as a method for identifying distinct sets of genes with similar expression patterns Tim Randolph & Garth Tan Presentation for Stat 593E.

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Feature Selection Lecture 5

Feature Selection Bioinformatics Data Analysis and Tools

CIBB-WIRN 2004 Perugia, 14 th -17 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini Feature.

Supervised gene expression data analysis using SVMs and MLPs Giorgio Valentini

Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Dimensionality reduction Usman Roshan CS 675. Supervised dim reduction: Linear discriminant analysis Fisher linear discriminant: –Maximize ratio of difference.

Classification of multiple cancer types by multicategory support vector machines using gene expression data.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Tie-Yan.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

1 Classifying Lymphoma Dataset Using Multi-class Support Vector Machines INFS-795 Advanced Data Mining Prof. Domeniconi Presented by Hong Chai.

The Broad Institute of MIT and Harvard Classification / Prediction.

Selection of Patient Samples and Genes for Disease Prognosis Limsoon Wong Institute for Infocomm Research Joint work with Jinyan Li & Huiqing Liu.

1 Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting Authors: A. Dupuy and R.M. Simon.

Chapter 16 The Chi-Square Statistic

Usman Roshan Machine Learning, CS 698

Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.

Designing multiple biometric systems: Measure of ensemble effectiveness Allen Tang NTUIM.

Evaluating Results of Learning Blaž Zupan

Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.

Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

1 Classification and Feature Selection Algorithms for Multi-class CGH data Jun Liu, Sanjay Ranka, Tamer Kahveci

Analyzing Expression Data: Clustering and Stats Chapter 16.

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Dimensionality reduction

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Gist 2.3 John H. Phan MIBLab Summer Workshop June 28th, 2006.

Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.

Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.

Principal Components Analysis ( PCA)

Methods of multivariate analysis Ing. Jozef Palkovič, PhD.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Doc.RNDr.Iveta Bedáňová, Ph.D.

Classification with Gene Expression Data

Glenn Fung, Murat Dundar, Bharat Rao and Jinbo Bi

Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani

Feature space tansformation methods

Somi Jacob and Christian Bach

An Introduction to Correlational Research

Chapter 7: Transformations

Lecture 16. Classification (II): Practical Considerations

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai

Outline Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work

Microarray Measures gene expression levels across different conditions, times or tissue samples Gene expression levels inform cell activity and disease status Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.

Microarray Data A matrix of measurements: rows are gene expression levels; columns are samples/conditions.

Example – Lymphoma Dataset

Microarray data analysis Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. Classification builds model to predict diseased samples. Diagnostic value.

Classification Problem Large number of genes (features) - may contain up to 20,000 features. Small number of experiments (samples) – hundreds but usually less than 100 samples. The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection

Our Focus Binary classification and feature selection methods extensively studied; Multi-class case received little attention. Practically many microarray datasets have more than two categories of samples We focus on multi-class gene ranking and selection.

Related Work Some criteria used in feature ranking Correlation coefficient Information gain Chi-squared SVM-RFE

Notation Given C classes m observations (samples or patients) n feature measurements (gene expressions) class labels y= 1,...,C

Correlation Coefficient Two class problem: y = {-1,+1} Ranking criterion defined in Golub: where μ j is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature

Fischer’s score Fisher’s criterion score in Pavlidis:

Assumption of above methods Features analyzed in isolation. Not considering correlations. Assumption: independent of each other Implication: redundant genes selected into a top subset.

Information Gain A measure of the effectiveness of a feature in classifying the training data. Expected reduction in entropy caused by partitioning the data according to this feature. V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v

Information Gain E(S) is the entropy of the entire set S. wherewhere |C i | is the number of training data in class C i, and |S| is thecardinality of the entire set S.

Chi-squared Measures features individually Continuous valued features discretized into intervals Form a matrix A, where A ij is the number of samples of the C i class within the j-th interval. Let C Ij be the number of samples in the j-th interval

Chi-squared The expected frequency of Aij is The Chi-squared statistic of a feature is defined as Where I is the number of intervals. The larger the statistic, the more informative the feature is.

SVM-RFE Recursive Feature Elimination using SVM In the linear SVM model on the full feature set Sign (wx + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.

SVM-RFE After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen.

Other criteria The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)

Our Ranking Methods BScatter MinMax bSum bMax bMin Combined

Notation For each class i and each feature j, we define the mean value of feature j for class C i : Define the total mean along feature j

Notation Define between-class scatter along feature j

Function 1: BScatter Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter where σ ji is the standard deviation of class i along feature j

Function 2: MinMax Favors features along which the farthest mean- class difference is large, and the within class variance is small.

Function 3: bSum For each feature j, we sort the C values μ j,i in non-decreasing order: μ j1 <= μ j2 …<= μ jC Define b j,l = μ j1+1 - μ j1 bSum rewards the features with large distances between adjacent mean class values:

Function 4: bMax Rewards features j with a large between-neighbor- class mean difference

Function 5: bMin Favorsthe features with large smallest between- neighbor-class mean difference

Function 6: Comb Considers a score function which combines MinMax and bMin

Datasets DatasetsamplegenesclassesComment MLL Available at oarray/Supplement Lymphoma Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in Yeast RAT, and 6 in TCL (normal samples). available at NCI Available at

Experiment Design Gene expression scaled between [-1,1] Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) Leave-one-out cross validation due to small sample size One-vs-one multi-class classification implemented on LIBSVM

Result – MLL Dataset

Result – Lymphoma Dataset

Conclusions SVMs classification benefits from gene selection; Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;

Conclusions For a given classification model, different gene selection methods reach the best performance for different feature set sizes; Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM- RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).

Limitations & Future Work The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. Evaluate top-ranked genes in our research against marker genes identified in other studies.