New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.

New EDA-approaches to feature selection for classification (of biological sequences)
Yvan Saeys

Outline Feature selection in the data mining process
Need for dimensionality reduction techniques in biology Feature selection techniques EDA-based wrapper approaches Constrained EDA approach EDA-ranking, EDA-weighting Application to biological sequence classification Why am I here ? Yvan Saeys, Donostia 2004

Feature selection in the data mining process
pre-processing feature extraction feature selection model induction/classification post-processing Yvan Saeys, Donostia 2004

Need for dimensionality reduction techniques in biology
Many biological processes are far from being completely understood In order not to miss relevant information Take into account as much features as possible Use dimension reduction techniques to identify the relevant feature subspaces Additional difficulty: Many feature dependencies Yvan Saeys, Donostia 2004

Dimension reduction techniques
Feature selection Projection Feature ranking Feature weighting Projection: everything that ends in “component analysis” Projection and compression transform the original features Feature selection techniques select a subset of the original features Compression … Yvan Saeys, Donostia 2004

Benefits of feature selection
Attain good or even better classification performance using a small subset of features Provide more cost-effective classifiers Less features to take into account faster classifiers Less features to store smaller datasets Gain more insight into the processes that generated the data Yvan Saeys, Donostia 2004

Feature selection: another layer of complexity
Bias-variance tradeoff of a classifier Model selection: find the best classifier with the best parameters for the best subset For every feature subset: model selection Extra dimension in the search process Yvan Saeys, Donostia 2004

Feature selection strategies
Filter approach Wrapper approach Embedded approach Classification Model FS FS Search Method Classification Model Classification Model Classification Model Parameters FS Feature selection based on signal processing techniques Yvan Saeys, Donostia 2004

Filter approach Independent of classification model
Uses only dataset of annotated examples A relevance measure for each feature is calculated: E.g: Feature – Class entropy Kullback-Leibler divergence (cross-entropy) Information gain, gain ratio Normalize relevance scores weights Fast, but discards feature dependencies Yvan Saeys, Donostia 2004

Wrapper approach Specific to a classification algorithm
The search for a good feature subset is guided by a search algorithm (e.g. greedy forward or backward) The algorithm uses the evaluation of the classifier as a guide to find good feature subsets Examples: sequential forward or backward search, simulated annealing, stochastic iterative sampling (e.g. GA, EDA) Computationally intensive, but able to take into account feature dependencies Yvan Saeys, Donostia 2004

Embedded approach Specific to a classification algorithm
Model parameters are directly used to discard features Examples: Reduced error pruning in decision trees Feature elimination using the weight vector of a linear discriminant function Usually needs only few additional calculations Able to take into account feature dependencies Yvan Saeys, Donostia 2004

EDA-based wrapper approaches
Yvan Saeys, Donostia 2004

Observations for (biological) datasets with many features: Many feature subsets result in the same classification performance Many features are irrelevant Search process spends most of its time in subsets containing approximately half of the number of features Yvan Saeys, Donostia 2004

Only a small fraction of the features are relevant Faster evaluation of a classification model when only a small number of features are present Constrained Estimation of Distribution Algorithm (CDA) : Determine an upper bound U for the maximally allowed number of features in every individual (sample) Apply a filter to the generated (sampled) individuals: allow at most U features in the subset Yvan Saeys, Donostia 2004

EDA-based wrapper approaches CDA
Advantages: Huge reduction of the search space Example : 400 features: Full search space: 2400 feature subsets U=100:  3.3E96 feature subsets Reduction by 23 orders of magnitude Faster evaluation of a classification model Scalable to datasets containing a very large number of features Scalable to more complex classification models (e.g. SVM using higher order polynomial kernel) Yvan Saeys, Donostia 2004

CDA: example Yvan Saeys, Donostia 2004 # F eatures # Ev Av erage # F
Balanced Un balanced NBM 150 68875 294.40 h 34 m 1 h 58 m SBE 80 76960 275.98 h 36 m 2 h 09 m 40 79380 269.48 h 37 m 2 h 11 m NBM 150 67100 150 h 20 m h 46 m CD A 80 67100 80 h 09 m h 21 m 40 67100 40 h 05 m h 11 m LSVM 150 68875 294.40 2 h 15 m 2 h 38 m SBE 80 76960 275.98 2 h 19 m 2 h 52 m 40 79380 269.48 2 h 20 m 2 h 54 m LSVM 150 67100 150 h 38 m h 59 m CD A 80 67100 80 h 17 m h 27 m 40 67100 40 h 14 m h 19 m PSVM 150 13875 296.26 9 h 11 m 62 h 02 m SBE 80 15520 277.68 9 h 42 m 63 h 24 m 40 16020 271.03 9 h 48 m 63 h 40 m PSVM 150 13510 150 4 h 54 m 16 h 48 m CD A 80 13510 80 2 h 48 m 9 h 38 m 40 13510 40 1 h 52 m 6 h 16 m Yvan Saeys, Donostia 2004

EDA-based feature ranking
Traditional approach to FS Only use the best individual found during the search -> optimal feature subset Many questions remain unanswered Single best subset provides a static view of the whole elimination process How much features can still be eliminated before classification performance drastically drops down Which features can still be eliminated ? Can we get a more dynamical analysis ? Yvan Saeys, Donostia 2004

Feature ranking Yvan Saeys, Donostia 2004

EDA-based feature ranking/weighting
Don’t use the single best individual Use the whole distribution to assess feature weights Use the weights to rank the features Yvan Saeys, Donostia 2004

EDA-based feature weighting
Can be used to do : Feature weighting Feature ranking Feature selection Problem : how “convergent” should the final population be ? Not enough convergence : no good feature subsets found yet (early stop) Too much convergence (in the limit, all individuals are the same) Solution Convergent enough but not too convergent  Yvan Saeys, Donostia 2004

How to quantify “enough but not too convergent” ?
Define the scaled Hamming distance between two individuals A and B as Convergence of a distribution : The average scaled Hamming distance between all pairs of individuals HD (A,B) HDS(A,B) = N Yvan Saeys, Donostia 2004

Application to gene prediction
Introns Transcription start site Start codon Stop codon Poly-A tail Enhancer Core promoter 5’ DNA Transcription Pre-mRNA Promoter region Exons Splicing (removal of introns) mRNA Protein Translation Yvan Saeys, Donostia 2004

Splice site prediction
Ex 1 Ex 2 Ex 3 Ex 4 I 1 I 2 I 3 Transcription Pre-mRNA splicing Translation Protein Ex GT …AG Donor site Acceptor site Yvan Saeys, Donostia 2004

Splice site prediction Features
Position dependent features e.g. an A on position 1, C on position 17, …. Position independent features e.g. subsequence “TCG” occurs, “GAG” occurs,… atcgatcagtatcgat GT ctgagctatgag atcgatcagtatcgat GT ctgagctatgag Yvan Saeys, Donostia 2004

Acceptor prediction Dataset: Local context of 100 nucleotides [50,50]
3000 positives 18,000 negatives Local context of 100 nucleotides [50,50] 100 4-valued features 400 binary features Classifiers: Naïve Bayes method C4.5 Linear SVM Yvan Saeys, Donostia 2004

2 convergent   2 convergent
A trial on acceptor prediction 400 binary features (position dependent nucleotides) Initial distribution : P(fi) = 0.5 C(D0) ~ 0.5 (each pair of individuals has on average half of the features in common) C(D) = 0 (all individuals are the same) Yvan Saeys, Donostia 2004

Evolution of convergence

Evaluation of convergence rate

EDA-based feature ranking
Best results obtained with “semi-converged” population Not looking for best subset anymore Looking for best distribution Advantage: Need less iterations Dynamical view of the feature selection process Yvan Saeys, Donostia 2004

EDA-based feature weighting
Color coding feature weights to visualize new patterns : A color coded mapping of the interval [0-1] Cold Middle Hot Yvan Saeys, Donostia 2004

Donor prediction : 400 features
T-rich region ? Local context G-rich region ? 3-base periodicity Yvan Saeys, Donostia 2004

Acceptor prediction: 400 features
Local context T-rich region (poly-pyrimidine stretch) 3-base periodicity Yvan Saeys, Donostia 2004

Acceptor prediction : 528 features

Acceptor prediction: 2096 features
AG-scanning TG Yvan Saeys, Donostia 2004

Comparison with NBM Yvan Saeys, Donostia 2004

Related & Future work Embedded feature selection in SVM with C-retraining Feature selection tree: combination of filter feature selection and decision tree Combining Bayesian decision trees and feature selection Combinatorial pattern matching in biological sequences Feature Selection Toolkit for large scale applications (FeaST) Yvan Saeys, Donostia 2004

Why am I here ? Establish collaboration between our research groups
Getting to know each other Think about future collaborations Define collaborative research projects Exchange thoughts/learn more about EDA methods Probabilistic graphical models for classification Biological problems Some ‘test cases’ during this months: apply some of ‘your’ techniques to ‘our’ data … Yvan Saeys, Donostia 2004

Thank you !! Yvan Saeys, Donostia 2004

New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.

Similar presentations

Presentation on theme: "New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.

Similar presentations

Presentation on theme: "New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys."— Presentation transcript:

Similar presentations

About project

Feedback