Correlation Aware Feature Selection Annalisa Barla Cesare Furlanello Giuseppe Jurman Stefano Merler Silvano Paoli Berlin – 8/10/2005
Overview On Feature Selection Correlation Aware Ranking Synthetic Example
Feature Selection Step-wise variable selection: n*<N effective variables modeling the classification function N features N steps Step 1Step N … One feature vs. N features …
Feature Selection Step-wise selection of the features. Steps Ranked Features Discarded Features
Ranking Classifier independent filters Prefiltering is risky: you might discard features that turns out to be important. (ignoring labelling) Induced by a classifier
Support Vector Machines Classification function: Optimal Separating Hyperplane
The classification/ranking machine The RFE idea: given N features (genes) Train a SVM Compute a cost function J from the weight coefficients of the the SVM Rank features in terms of contribution to J Discard the feature less contributing to J Reapply procedure on the N-1 features This is called Recursive Feature Elimination (RFE) Features are ranked according to their contribute to the classification, given the training data. Time and data consuming, and at risk of selection bias Guyon et al. 2002
RFE-based Methods Considering chunks of data at a time : Parametrics Sqrt(N) – RFE Bisection – RFE Non-Parametrics E – RFE (adapting to weight distribution): thresholding weights to a value w*
Variable Elimination Given F={x 1, x 2, …, x H } such that: for a given threshold T. w(x 1 )~w(x 2 ) ~ … ~ ε < w* w(x 1 )+w(x 2 )+ … >> w* Each single weight is negligible Correlated genes BUT
Correlated Genes (1)
Correlated Genes (2)
Synthetic Data Binary problem 100 (50 +50) samples of 1000 genes: genes 1 50 : randomly extracted from N(1,1) and N(-1,1) respectively genes 50 100 : randomly extracted from N(1,1) and N(-1,1) respectively (1 repeated 50 times) genes 101 1000 extracted from UNIF(-4,4) Class 1: 50 Class 2: x50 N(1,1) N(-1,1) 1 feat repeated Unif(-4,4) significant features
Our algorithm step j
Methodology Implemented within the BioDCV system (50 replicates) Realized through R - C code interaction
Synthetic Data Gene 100 is consistently ranked as 2nd steps
Work in Progress Preservation of high correlated genes with low initial weights on microarrays datasets Robust correlation measures Different techniques to detect F l families (clustering, gene functions)
Synthetic Data Stepfeatures 1-50 features features > SAVED500 Stepfeatures 1-50 features features > SAVED500
Synthetic Data Features discarded at step 9 from E-RFE procedure: Correlation Correction: Saves feature 100
INFRASTRUCTURE MPACluster -> available for batch jobs Connecting with IFOM -> 2005 Running at IFOM -> 2005/2006 Production on GRID resources (spring 2005) Challenges ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation ALGORITHMS II 1.Gene list fusion: suite of algebraic/statistical methods 2.Prediction over multi-platform gene expression datasets (sarcoma, breast cancer): large scale semi- supervised analysis 3.New SVM Kernels for prediction on spectrometry data within complete validation Challenges for predictive profiling
Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps
A few issues in feature selection with a particular interest on classification of genomic data WHY? To ease computational burden To enhance information Discard the (apparently) less significant features and train in a simplified space: alleviate the curse of dimensionality Highlight (and rank) the most important features and improve the knowledge of the underlying process. HOW? As a pre-processing stepAs a learning step Employ a statistical filter (t-test, S2N) Link the feature ranking to the classification task: wrapper methods, …
Prefiltering is risky: you might discard features that turns out to be important. Nevertheless, wrapper methods are quite costing. Moreover, in the gene expression data, you have to deal also with particular situations like clones or highly correlated features that may represent a pitfall for several selection methods. A classic alternative is to map into linear combination of features, and then select. Principal Component Analysis Metagenes (a simplified model for pathways: but biological suggestions require caution) But we are not working anymore with the original features. eigen-craters for unexploded bomb risk maps
Feature Selection within Complete Validation Experimental Setups Complete Validation is needed to decouple model tuning from (ensemble) model accuracy estimation: otherwise selection bias effects … Accumulating rel. importance from Random Forest models for the identification of sensory drivers (with P. Granitto, IASMA)