Lecture 2: Introduction to Feature Selection Isabelle Guyon

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Minimum Redundancy and Maximum Relevance Feature Selection
Lecture 5: Causality and Feature Selection Isabelle Guyon
Lecture 4: Embedded methods
Feature Selection Presented by: Nafise Hatamikhah
Lecture 3: Introduction to Feature Selection Isabelle Guyon
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
4 th NETTAB Workshop Camerino, 5 th -7 th September 2004 Alberto Bertoni, Raffaella Folgieri, Giorgio Valentini
Darlene Goldstein 29 January 2003 Receiver Operating Characteristic Methodology.
Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning.
Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.
Feature Selection Lecture 5
Feature Selection Bioinformatics Data Analysis and Tools
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
CHAPTER 4: Parametric Methods. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Parametric Estimation X = {
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Feature selection methods Isabelle Guyon IPAM summer school on Mathematics in Brain Imaging. July 2008.
Feature selection and causal discovery fundamentals and applications Isabelle Guyon
Guidelines on Statistical Analysis and Reporting of DNA Microarray Studies of Clinical Outcome Richard Simon, D.Sc. Chief, Biometric Research Branch National.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
INTRODUCTION TO Machine Learning 3rd Edition
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Baseline Methods for the Feature Extraction Class Isabelle Guyon Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000.
Gene Expression Profiling Illustrated Using BRB-ArrayTools.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.
The Broad Institute of MIT and Harvard Classification / Prediction.
Controlling FDR in Second Stage Analysis Catherine Tuglus Work with Mark van der Laan UC Berkeley Biostatistics.
Filter + Support Vector Machine for NIPS 2003 Challenge Jiwen Li University of Zurich Department of Informatics The NIPS 2003 challenge was organized to.
Usman Roshan Machine Learning, CS 698
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Applying Statistical Machine Learning to Retinal Electrophysiology Matt Boardman January, 2006 Faculty of Computer Science.
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Competitions in machine learning: the fun, the art, and the science Isabelle Guyon Clopinet, Berkeley, California
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
The Broad Institute of MIT and Harvard Differential Analysis.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
A Brief Introduction and Issues on the Classification Problem Jin Mao Postdoc, School of Information, University of Arizona Sept 18, 2015.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Feature Selection Methods Part-I By: Dr. Rajeev Srivastava IIT(BHU), Varanasi.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Machine Learning – Classification David Fenyő
In Search of the Optimal Set of Indicators when Classifying Histopathological Images Catalin Stoean University of Craiova, Romania
Usman Roshan Machine Learning
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Overview of Supervised Learning
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Feature selection Usman Roshan.
Combining HMMs with SVMs
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Boosting For Tumor Classification With Gene Expression Data
Usman Roshan Machine Learning
Model generalization Brief summary of methods
Feature Selection, Projection and Extraction
Feature Selection Methods
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Lecture 2: Introduction to Feature Selection Isabelle Guyon

Notations and Examples

Feature Selection Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. X n m n’

Leukemia Diagnosis Golub et al, Science Vol 286:15 Oct m n’ {y i }, i=1:m {-y i }

RFE SVM, Guyon-Weston, US patent 7,117,188 Application to prostate cancer. Elisseeff-Weston, 2001 U29589 HOXC8 RACH1 BPH G4 G3 Prostate Cancer Genes

Differenciation of 14 tumors. Ramaswamy et al, PNAS, 2001 RFE SVM for cancer diagnosis

QSAR: Drug Screening Binding to Thrombin (DuPont Pharmaceuticals) compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting; 192 “active” (bind well); the rest “inactive”. Training set (1909 compounds) more depleted in active compounds ,351 binary features, which describe three-dimensional properties of the molecule. Weston et al, Bioinformatics, 2002 Number of features

Text Filtering Bekkerman et al, JMLR, 2003 Top 3 words of some categories: Alt.atheism: atheism, atheists, morality Comp.graphics: image, jpeg, graphics Sci.space: space, nasa, orbit Soc.religion.christian: god, church, sin Talk.politics.mideast: israel, armenian, turkish Talk.religion.misc: jesus, god, jehovah Reuters: news wire, 114 semantic categories. 20 newsgroups: articles, 20 categories. WebKB: 8282 web pages, 7 categories. Bag-of-words: > features.

Face Recognition Relief: Simba: Male/female classification 1450 images (1000 train, 450 test), 5100 features (images 60x85 pixels) Navot-Bachrach-Tishby, ICML 2004

Nomenclature Univariate method: considers one variable (feature) at a time. Multivariate method: considers subsets of variables (features) together. Filter method: ranks features or feature subsets independently of the predictor (classifier). Wrapper method: uses a classifier to assess features or feature subsets.

Univariate Filter Methods

Individual Feature Irrelevance P(X i, Y) = P(X i ) P(Y) P(X i | Y) = P(X i ) P(X i | Y=1) = P(X i | Y=-1) xixi density Legend: Y=1 Y=-1

Individual Feature Relevance -- ++ -- ++ xixi 1 1- Specificity Sensitivity ROC curve AUC 0 1

S2N -- ++ -- ++ Golub et al, Science Vol 286:15 Oct m {-y i } {y i } S2N = |  + -  -|  + +  - S2N  R  x  y after “standardization” x  (x-  x )/  x

Univariate Dependence Independence: P(X, Y) = P(X) P(Y) Measure of dependence: MI(X, Y) =  P(X,Y) log dX dY = KL ( P(X,Y) || P(X)P(Y) ) P(X,Y) P(X)P(Y)

Correlation and MI X Y Y X Y X P(Y) P(X) R= MI=1.65 nat R=0.02 MI=1.03 nat

Gaussian Distribution X Y Y X Y X P(Y) P(X) MI(X, Y) = -(1/2) log(1-R 2 )

Other criteria ( chap. 3)

T-test Normally distributed classes, equal variance  2 unknown; estimated from data as  2 within. Null hypothesis H 0 :  + =  - T statistic: If H 0 is true, t= (  + -  -)/(  within  m + +1/m -  Student  m + +m - -  d.f.  -- ++ -- ++ P(X i |Y=-1) P(X i |Y=1) xixi

Statistical tests ( chap. 2) H 0 : X and Y are independent. Relevance index  test statistic. Pvalue  false positive rate FPR = n fp / n irr Multiple testing problem: use Bonferroni correction pval  n pval False discovery rate: FDR = n fp / n sc  FPR n/n sc Probe method: FPR  n sp /n p pval r0r0 r Null distribution

Multivariate Methods

Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006

Filters vs. Wrappers Main goal: rank subsets of useful features. All features Filter Feature subset Predictor All features Wrapper Multiple Feature subsets Predictor Danger of over-fitting with intensive search!

Forward selection or backward elimination. Beam search: keep k best path at each step. GSFS: generalized sequential forward selection – when (n-k) features are left try all subsets of g features i.e. ( ) trainings. More trainings at each step, but fewer steps. PTA(l,r): plus l, take away r – at each step, run SFS l times then SBS r times. Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly. n-k g Search Strategies ( chap. 4)

Multivariate FS is complex N features, 2 N possible feature subsets! Kohavi-John, 1997

Yes, stop! All features Eliminate useless feature(s) Train SVM Eliminate useless feature(s) Train SVM Embedded methods Performance degradation? Train SVM Eliminate useless feature(s) Train SVM Train SVM Eliminate useless feature(s) Eliminate useless feature(s) No, continue… Recursive Feature Elimination (RFE) SVM. Guyon-Weston, US patent 7,117,188

Yes, stop! All features Eliminate useless feature(s) Train SVM Eliminate useless feature(s) Train SVM Performance degradation? Train SVM Eliminate useless feature(s) Train SVM Train SVM Eliminate useless feature(s) Eliminate useless feature(s) No, continue… Recursive Feature Elimination (RFE) SVM. Guyon-Weston, US patent 7,117,188 Embedded methods

Feature subset assessment 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. –Repeat and average if you want to reduce variance (cross- validation). 3) Test on test data. N variables/features M samples m1m1 m2m2 m3m3 Split data into 3 sets: training, validation, and test set.

Complexity of Feature Selection Generalization_error  Validation_error +  (C/m 2 ) m 2 : number of validation examples, N: total number of features, n: feature subset size. With high probability: n Error Try to keep C of the order of m 2.

Examples of FS algorithms keep C = O(m 2 ) keep C = O(m 1 ) Nearest Neighbors Neural Nets Trees, SVM Mutual information feature ranking Non-linear RFE with linear SVM or LDA T-test, AUC, feature ranking Linear MultivariateUnivariate

In practice… No method is universally better: –wide variety of types of variables, data distributions, learning machines, and objectives. Match the method complexity to the ratio M/N: –univariate feature selection may work better than multivariate feature selection; non-linear classifiers are not always better. Feature selection is not always necessary to achieve good performance. NIPS 2003 and WCCI 2006 challenges :

Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, Book of the NIPS 2003 challenge