New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.

DECISION TREES. Decision trees  One possible representation for hypotheses.

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.

Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell.

1 Semi-supervised learning for protein classification Brian R. King Chittibabu Guda, Ph.D. Department of Computer Science University at Albany, SUNY Gen*NY*sis.

Optimization of SVM Parameters for Promoter Recognition in DNA Sequences Robertas Damaševičius Software Engineering Department, Kaunas University of Technology.

Minimum Redundancy and Maximum Relevance Feature Selection

Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”

Chapter 4: Linear Models for Classification

Face Recognition and Biometric Systems

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Lecture 4: Embedded methods

1 Wireless Communication Low Complexity Multiuser Detection Rami Abdallah University of Illinois at Urbana Champaign 12/06/2007.

Feature Selection Presented by: Nafise Hatamikhah

Exploratory Data Mining and Data Preparation

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

COMP 328: Midterm Review Spring 2010 Nevin L. Zhang Department of Computer Science & Engineering The Hong Kong University of Science & Technology

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Reduced Support Vector Machine

Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.

Feature Selection Lecture 5

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Selecting Informative Genes with Parallel Genetic Algorithms Deodatta Bhoite Prashant Jain.

Sparse vs. Ensemble Approaches to Supervised Learning

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

A Study of GeneWise with the Drosophila Adh Region Asta Gindulyte CMSC 838 Presentation Authors: Yi Mo, Moira Regelson, and Mike Sievers Paracel Inc.,

Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.

A Genetic Algorithms Approach to Feature Subset Selection Problem by Hasan Doğu TAŞKIRAN CS 550 – Machine Learning Workshop Department of Computer Engineering.

Efficient Model Selection for Support Vector Machines

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

EMIS 8381 – Spring Netflix and Your Next Movie Night Nonlinear Programming Ron Andrews EMIS 8381.

Chapter 7: Transformations. Attribute Selection Adding irrelevant attributes confuses learning algorithms---so avoid such attributes Both divide-and-conquer.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.

+ Get Rich and Cure Cancer with Support Vector Machines (Your Summer Projects)

Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Feature selection with Neural Networks Dmitrij Lagutin, T Variable Selection for Regression

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Feature Selection and Weighting using Genetic Algorithm for Off-line Character Recognition Systems Faten Hussein Presented by The University of British.

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Data Mining and Decision Support

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Nawanol Theera-Ampornpunt, Seong Gon Kim, Asish Ghoshal, Saurabh Bagchi, Ananth Grama, and Somali Chaterji Fast Training on Large Genomics Data using Distributed.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Computational Intelligence: Methods and Applications Lecture 34 Applications of information theory and selection of information Włodzisław Duch Dept. of.

High resolution product by SVM. L’Aquila experience and prospects for the validation site R. Anniballe DIET- Sapienza University of Rome.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

10. Decision Trees and Markov Chains for Gene Finding.

bacteria and eukaryotes

Computational Intelligence: Methods and Applications

Data Mining (and machine learning)

Machine Learning Feature Creation and Selection

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

The loss function, the normal equation,

Chapter 7: Transformations

Feature Selection Methods

Presentation transcript:

New EDA-approaches to feature selection for classification (of biological sequences) Yvan Saeys

Outline Feature selection in the data mining process Need for dimensionality reduction techniques in biology Feature selection techniques EDA-based wrapper approaches Constrained EDA approach EDA-ranking, EDA-weighting Application to biological sequence classification Why am I here ? Yvan Saeys, Donostia 2004

Feature selection in the data mining process pre-processing feature extraction feature selection model induction/classification post-processing Yvan Saeys, Donostia 2004

Need for dimensionality reduction techniques in biology Many biological processes are far from being completely understood In order not to miss relevant information Take into account as much features as possible Use dimension reduction techniques to identify the relevant feature subspaces Additional difficulty: Many feature dependencies Yvan Saeys, Donostia 2004

Dimension reduction techniques Feature selection Projection Feature ranking Feature weighting Projection: everything that ends in “component analysis” Projection and compression transform the original features Feature selection techniques select a subset of the original features Compression … Yvan Saeys, Donostia 2004

Benefits of feature selection Attain good or even better classification performance using a small subset of features Provide more cost-effective classifiers Less features to take into account faster classifiers Less features to store smaller datasets Gain more insight into the processes that generated the data Yvan Saeys, Donostia 2004

Feature selection: another layer of complexity Bias-variance tradeoff of a classifier Model selection: find the best classifier with the best parameters for the best subset For every feature subset: model selection Extra dimension in the search process Yvan Saeys, Donostia 2004

Feature selection strategies Filter approach Wrapper approach Embedded approach Classification Model FS FS Search Method Classification Model Classification Model Classification Model Parameters FS Feature selection based on signal processing techniques Yvan Saeys, Donostia 2004

Filter approach Independent of classification model Uses only dataset of annotated examples A relevance measure for each feature is calculated: E.g: Feature – Class entropy Kullback-Leibler divergence (cross-entropy) Information gain, gain ratio Normalize relevance scores weights Fast, but discards feature dependencies Yvan Saeys, Donostia 2004

Wrapper approach Specific to a classification algorithm The search for a good feature subset is guided by a search algorithm (e.g. greedy forward or backward) The algorithm uses the evaluation of the classifier as a guide to find good feature subsets Examples: sequential forward or backward search, simulated annealing, stochastic iterative sampling (e.g. GA, EDA) Computationally intensive, but able to take into account feature dependencies Yvan Saeys, Donostia 2004

Embedded approach Specific to a classification algorithm Model parameters are directly used to discard features Examples: Reduced error pruning in decision trees Feature elimination using the weight vector of a linear discriminant function Usually needs only few additional calculations Able to take into account feature dependencies Yvan Saeys, Donostia 2004

EDA-based wrapper approaches Yvan Saeys, Donostia 2004

EDA-based wrapper approaches Observations for (biological) datasets with many features: Many feature subsets result in the same classification performance Many features are irrelevant Search process spends most of its time in subsets containing approximately half of the number of features Yvan Saeys, Donostia 2004

EDA-based wrapper approaches Only a small fraction of the features are relevant Faster evaluation of a classification model when only a small number of features are present Constrained Estimation of Distribution Algorithm (CDA) : Determine an upper bound U for the maximally allowed number of features in every individual (sample) Apply a filter to the generated (sampled) individuals: allow at most U features in the subset Yvan Saeys, Donostia 2004

EDA-based wrapper approaches CDA Advantages: Huge reduction of the search space Example : 400 features: Full search space: 2400 feature subsets U=100:  3.3E96 feature subsets Reduction by 23 orders of magnitude Faster evaluation of a classification model Scalable to datasets containing a very large number of features Scalable to more complex classification models (e.g. SVM using higher order polynomial kernel) Yvan Saeys, Donostia 2004

CDA: example Yvan Saeys, Donostia 2004 # F eatures # Ev Av erage # F Balanced Un balanced NBM 150 68875 294.40 h 34 m 1 h 58 m SBE 80 76960 275.98 h 36 m 2 h 09 m 40 79380 269.48 h 37 m 2 h 11 m NBM 150 67100 150 h 20 m h 46 m CD A 80 67100 80 h 09 m h 21 m 40 67100 40 h 05 m h 11 m LSVM 150 68875 294.40 2 h 15 m 2 h 38 m SBE 80 76960 275.98 2 h 19 m 2 h 52 m 40 79380 269.48 2 h 20 m 2 h 54 m LSVM 150 67100 150 h 38 m h 59 m CD A 80 67100 80 h 17 m h 27 m 40 67100 40 h 14 m h 19 m PSVM 150 13875 296.26 9 h 11 m 62 h 02 m SBE 80 15520 277.68 9 h 42 m 63 h 24 m 40 16020 271.03 9 h 48 m 63 h 40 m PSVM 150 13510 150 4 h 54 m 16 h 48 m CD A 80 13510 80 2 h 48 m 9 h 38 m 40 13510 40 1 h 52 m 6 h 16 m Yvan Saeys, Donostia 2004

EDA-based feature ranking Traditional approach to FS Only use the best individual found during the search -> optimal feature subset Many questions remain unanswered Single best subset provides a static view of the whole elimination process How much features can still be eliminated before classification performance drastically drops down Which features can still be eliminated ? Can we get a more dynamical analysis ? Yvan Saeys, Donostia 2004

Feature ranking Yvan Saeys, Donostia 2004

EDA-based feature ranking/weighting Don’t use the single best individual Use the whole distribution to assess feature weights Use the weights to rank the features Yvan Saeys, Donostia 2004

EDA-based feature weighting Can be used to do : Feature weighting Feature ranking Feature selection Problem : how “convergent” should the final population be ? Not enough convergence : no good feature subsets found yet (early stop) Too much convergence (in the limit, all individuals are the same) Solution Convergent enough but not too convergent  Yvan Saeys, Donostia 2004

How to quantify “enough but not too convergent” ? Define the scaled Hamming distance between two individuals A and B as Convergence of a distribution : The average scaled Hamming distance between all pairs of individuals HD (A,B) HDS(A,B) = N Yvan Saeys, Donostia 2004

Application to gene prediction Introns Transcription start site Start codon Stop codon Poly-A tail Enhancer Core promoter 5’ DNA Transcription Pre-mRNA Promoter region Exons Splicing (removal of introns) mRNA Protein Translation Yvan Saeys, Donostia 2004

Splice site prediction Ex 1 Ex 2 Ex 3 Ex 4 I 1 I 2 I 3 Transcription Pre-mRNA splicing Translation Protein Ex GT.. …AG Donor site Acceptor site Yvan Saeys, Donostia 2004

Splice site prediction Features Position dependent features e.g. an A on position 1, C on position 17, …. Position independent features e.g. subsequence “TCG” occurs, “GAG” occurs,… 1 2 3 17 28 atcgatcagtatcgat GT ctgagctatgag atcgatcagtatcgat GT ctgagctatgag Yvan Saeys, Donostia 2004

Acceptor prediction Dataset: Local context of 100 nucleotides [50,50] 3000 positives 18,000 negatives Local context of 100 nucleotides [50,50] 100 4-valued features 400 binary features Classifiers: Naïve Bayes method C4.5 Linear SVM Yvan Saeys, Donostia 2004

2 convergent   2 convergent A trial on acceptor prediction 400 binary features (position dependent nucleotides) Initial distribution : P(fi) = 0.5 C(D0) ~ 0.5 (each pair of individuals has on average half of the features in common) C(D) = 0 (all individuals are the same) Yvan Saeys, Donostia 2004

Evolution of convergence Yvan Saeys, Donostia 2004

Evaluation of convergence rate Yvan Saeys, Donostia 2004

Evaluation of convergence rate Yvan Saeys, Donostia 2004

EDA-based feature ranking Best results obtained with “semi-converged” population Not looking for best subset anymore Looking for best distribution Advantage: Need less iterations Dynamical view of the feature selection process Yvan Saeys, Donostia 2004

EDA-based feature weighting Color coding feature weights to visualize new patterns : A color coded mapping of the interval [0-1] Cold Middle Hot Yvan Saeys, Donostia 2004

Donor prediction : 400 features T-rich region ? Local context G-rich region ? 3-base periodicity Yvan Saeys, Donostia 2004

Donor prediction : 528 features Yvan Saeys, Donostia 2004

Donor prediction : 2096 features Yvan Saeys, Donostia 2004

Acceptor prediction: 400 features Local context T-rich region (poly-pyrimidine stretch) 3-base periodicity Yvan Saeys, Donostia 2004

Acceptor prediction : 528 features Yvan Saeys, Donostia 2004

Acceptor prediction: 2096 features AG-scanning TG Yvan Saeys, Donostia 2004

Comparison with NBM Yvan Saeys, Donostia 2004

Related & Future work Embedded feature selection in SVM with C-retraining Feature selection tree: combination of filter feature selection and decision tree Combining Bayesian decision trees and feature selection Combinatorial pattern matching in biological sequences Feature Selection Toolkit for large scale applications (FeaST) Yvan Saeys, Donostia 2004

Why am I here ? Establish collaboration between our research groups Getting to know each other Think about future collaborations Define collaborative research projects Exchange thoughts/learn more about EDA methods Probabilistic graphical models for classification Biological problems Some ‘test cases’ during this months: apply some of ‘your’ techniques to ‘our’ data … Yvan Saeys, Donostia 2004

Thank you !! Yvan Saeys, Donostia 2004