Download presentation
Presentation is loading. Please wait.
Published byMary Black Modified over 9 years ago
1
An Evaluation of Gene Selection Methods for Multi-class Microarray Data Classification by Carlotta Domeniconi and Hong Chai
2
Outline Introduction to microarray data Problem description Related work Our methods Experimental Analysis Result Conclusion and future work
3
Microarray Measures gene expression levels across different conditions, times or tissue samples Gene expression levels inform cell activity and disease status Microarray data distinguish between tumor types, define new subtypes, predict prognostic outcome, identify possible drugs, assess drug toxicity, etc.
4
Microarray Data A matrix of measurements: rows are gene expression levels; columns are samples/conditions.
5
Example – Lymphoma Dataset
6
Microarray data analysis Clustering applied to genes to identify genes with similar functions or participate in similar biological processes, or to samples to find potential tumor subclasses. Classification builds model to predict diseased samples. Diagnostic value.
7
Classification Problem Large number of genes (features) - may contain up to 20,000 features. Small number of experiments (samples) – hundreds but usually less than 100 samples. The need to identify “marker genes” to classify tissue types, e.g. diagnose cancer - feature selection
8
Our Focus Binary classification and feature selection methods extensively studied; Multi-class case received little attention. Practically many microarray datasets have more than two categories of samples We focus on multi-class gene ranking and selection.
9
Related Work Some criteria used in feature ranking Correlation coefficient Information gain Chi-squared SVM-RFE
10
Notation Given C classes m observations (samples or patients) n feature measurements (gene expressions) class labels y= 1,...,C
11
Correlation Coefficient Two class problem: y = {-1,+1} Ranking criterion defined in Golub: where μ j is the mean and σ standard deviation along dimension j in the + and – classes; Large |w| indicates discriminant feature
12
Fischer’s score Fisher’s criterion score in Pavlidis:
13
Assumption of above methods Features analyzed in isolation. Not considering correlations. Assumption: independent of each other Implication: redundant genes selected into a top subset.
14
Information Gain A measure of the effectiveness of a feature in classifying the training data. Expected reduction in entropy caused by partitioning the data according to this feature. V (A) is the set of all possible values of feature A, and Sv is the subset of S for which feature A has value v
15
Information Gain E(S) is the entropy of the entire set S. wherewhere |C i | is the number of training data in class C i, and |S| is thecardinality of the entire set S.
16
Chi-squared Measures features individually Continuous valued features discretized into intervals Form a matrix A, where A ij is the number of samples of the C i class within the j-th interval. Let C Ij be the number of samples in the j-th interval
17
Chi-squared The expected frequency of Aij is The Chi-squared statistic of a feature is defined as Where I is the number of intervals. The larger the statistic, the more informative the feature is.
18
SVM-RFE Recursive Feature Elimination using SVM In the linear SVM model on the full feature set Sign (wx + b) w is a vector of weights for each feature, x is an input instance, and b a threshold. If wi = 0, feature Xi does not influence classification and can be eliminated from the set of features.
19
SVM-RFE After getting w for the full feature set, sort features in descending order of weights. A percentage of lower feature is eliminated. 3. A new linear SVM is built using the new set of features. Repeat the process. 4. The best feature subset is chosen.
20
Other criteria The Brown-Forsythe, the Cochran, and the Welch test statistics used in Chen, et al. (Extensions of the t-statistic used in the two-class classification problem.) PCA (Disadvantage: new dimension formed. None of the original features can be discarded. Therefore can’t identify marker genes.)
21
Our Ranking Methods BScatter MinMax bSum bMax bMin Combined
22
Notation For each class i and each feature j, we define the mean value of feature j for class C i : Define the total mean along feature j
23
Notation Define between-class scatter along feature j
24
Function 1: BScatter Fisher discriminant analysis for multiple classes under feature independence assumption. It credits the largest score to the feature that maximizes the ratio of the between-class scatter to the within-class scatter where σ ji is the standard deviation of class i along feature j
25
Function 2: MinMax Favors features along which the farthest mean- class difference is large, and the within class variance is small.
26
Function 3: bSum For each feature j, we sort the C values μ j,i in non-decreasing order: μ j1 <= μ j2 …<= μ jC Define b j,l = μ j1+1 - μ j1 bSum rewards the features with large distances between adjacent mean class values:
27
Function 4: bMax Rewards features j with a large between-neighbor- class mean difference
28
Function 5: bMin Favorsthe features with large smallest between- neighbor-class mean difference
29
Function 6: Comb Considers a score function which combines MinMax and bMin
30
Datasets DatasetsamplegenesclassesComment MLL72125823 Available at http://research.nhgri.nih.gov/micr oarray/Supplement Lymphoma8840266 Number of samples in each class are, 46 in DLBCL, 11 in CLL, 9 in FL (malignant classes), 11 in ABB, 6 in Yeast8057753 RAT, and 6 in TCL (normal samples). available at http://llmpp.nih.gov/lymphoma NCI606111558 Available at http://rana.lbl.gov/
31
Experiment Design Gene expression scaled between [-1,1] Performed 9 comparative feature selection methods (6 proposed scores, Chi-squared, Information Gain, and SVM-RFE) Obtain subsets of top-ranked genes to train SVM classifier (3 kernel functions: linear, 2-degree polynomial, Gaussian; Soft-margin [1,100]; Gaussian kernel [0.001,2]) Leave-one-out cross validation due to small sample size One-vs-one multi-class classification implemented on LIBSVM
32
Result – MLL Dataset
33
Result – Lymphoma Dataset
34
Conclusions SVMs classification benefits from gene selection; Gene ranking with correlation coefficients gives higher accuracy than SVM-RFE in low dimensions in most data sets. The best performing correlation score varies from problem to problem; Although SVM-RFE shows an excellent performance in general, there is no clear winner. The performance of feature selection methods seems to be problem-dependent;
35
Conclusions For a given classification model, different gene selection methods reach the best performance for different feature set sizes; Very high accuracy was achieved on all the data sets studied here. In many cases perfect accuracy (based on leave-one-out error) was achieved; The NCI60 dataset [17] shows lower accuracy values. This dataset has the largest number of classes (eight), and smaller sample sizes per class. SVM-RFE handles this case well, achieving 96.72% accuracy with 100 selected genes and a linear kernel. The gap in accuracy between SVM- RFE and the other gene rankingmethods is highest for this dataset (ca. 11.5%).
36
Limitations & Future Work The selection of features over the whole training set induces a bias in the results. Will study valuable suggestions on how to assess and correct the bias in future experiments. Will take into consideration the correlation between any pair of selected features. Ranking method will be modified so that correlations are lower than a certain threshold. Evaluate top-ranked genes in our research against marker genes identified in other studies.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.