Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning.

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

Chapter 5 Multiple Linear Regression
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Data Mining Feature Selection. Data reduction: Obtain a reduced representation of the data set that is much smaller in volume but yet produces the same.
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
R OBERTO B ATTITI, M AURO B RUNATO The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Can causal models be evaluated? Isabelle Guyon ClopiNet / ChaLearn
Minimum Redundancy and Maximum Relevance Feature Selection
COMPUTER AIDED DIAGNOSIS: FEATURE SELECTION Prof. Yasser Mostafa Kadah –
Lecture 5: Causality and Feature Selection Isabelle Guyon
Lecture 4: Embedded methods
Feature Selection Presented by: Nafise Hatamikhah
Causality Workbenchclopinet.com/causality Results of the Causality Challenge Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt.
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
Lecture 3: Introduction to Feature Selection Isabelle Guyon
Feature Selection for Regression Problems
Statistical Methods Chichang Jou Tamkang University.
Lecture 2: Introduction to Feature Selection Isabelle Guyon
Feature Selection Bioinformatics Data Analysis and Tools
Lecture 6: Causal Discovery Isabelle Guyon
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Feature selection methods Isabelle Guyon IPAM summer school on Mathematics in Brain Imaging. July 2008.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Feature selection and causal discovery fundamentals and applications Isabelle Guyon
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.
315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
CJT 765: Structural Equation Modeling Class 7: fitting a model, fit indices, comparingmodels, statistical power.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Usman Roshan Machine Learning, CS 698
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy.
Lecture 5: Causality and Feature Selection Isabelle Guyon
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Competitions in machine learning: the fun, the art, and the science Isabelle Guyon Clopinet, Berkeley, California
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Challenges in causality: Results of the WCCI 2008 challenge Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt Univ. André.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Data Mining and Decision Support
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
NURS 306, Nursing Research Lisa Broughton, MSN, RN, CCRN RESEARCH STATISTICS.
Machine Learning with Spark MLlib
Machine Learning – Classification David Fenyő
Chapter 3: Maximum-Likelihood Parameter Estimation
Introduction to Machine Learning and Tree Based Methods
Usman Roshan Machine Learning
regression Can BMI explain Y? Can BMI predict Y?
CJT 765: Structural Equation Modeling
COMP61011 Foundations of Machine Learning Feature Selection
Roberto Battiti, Mauro Brunato
Machine Learning Feature Creation and Selection
Lecture 3: Causality and Feature Selection
Feature selection Usman Roshan.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Usman Roshan Machine Learning
Feature Selection, Projection and Extraction
Feature Selection Methods
Presentation transcript:

Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning

1)Feature Extraction, Foundations and Applications I. Guyon, S. Gunn, et al. Springer, ) Causal feature selection I. Guyon, C. Aliferis, A. Elisseeff To appear in “Computational Methods of Feature Selection”, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, Acknowledgements and references

Constantin AliferisAlexander Statnikov André ElisseeffJean-Philippe Pellet Gregory F. CooperPeter Spirtes

Introduction

Feature Selection Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. X n m n’

Applications Bioinformatics Quality control Machine vision Customer knowledge variables/features examples OCR HWR Market Analysis Text Categorization System diagnosis

Nomenclature Univariate method: considers one variable (feature) at a time. Multivariate method: considers subsets of variables (features) together. Filter method: ranks features or feature subsets independently of the predictor (classifier). Wrapper method: uses a classifier to assess features or feature subsets.

Univariate Filter Methods

Univariate feature ranking Normally distributed classes, equal variance  2 unknown; estimated from data as  2 within. Null hypothesis H 0 :  + =  - T statistic: If H 0 is true, t= (  + -  -)/(  within  m + +1/m -  Student  m + +m - -  d.f.  -- ++ -- ++ P(X i |Y=-1) P(X i |Y=1) xixi

H 0 : X and Y are independent. Relevance index  test statistic. Pvalue  false positive rate FPR = n fp / n irr Multiple testing problem: use Bonferroni correction pval  n pval False discovery rate: FDR = n fp / n sc  FPR n/n sc Probe method: FPR  n sp /n p pval r0r0 r Null distribution Statistical tests ( chap. 2) ( Guyon, Dreyfus, 2006, )

Univariate Dependence Independence: P(X, Y) = P(X) P(Y) Measure of dependence: MI(X, Y) =  P(X,Y) log dX dY = KL ( P(X,Y) || P(X)P(Y) ) P(X,Y) P(X)P(Y)

A choice of feature selection ranking methods depending on the nature of: the variables and the target (binary, categorical, continuous) the problem (dependencies between variables, linear/non- linear relationships between variables and target) the available data (number of examples and number of variables, noise in data) the available tabulated statistics. Other criteria ( chap. 3) ( Wlodzislaw Duch, 2006)

Multivariate Methods

Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006

Filters,Wrappers, and Embedded methods All features Filter Feature subset Predictor All features Wrapper Multiple Feature subsets Predictor All features Embedded method Feature subset Predictor

Relief nearest hit nearest miss D hit D miss Relief= D hit D miss Kira and Rendell, 1992

Wrappers for feature selection N features, 2 N possible feature subsets! Kohavi-John, 1997

Exhaustive search. Stochastic search (simulated annealing, genetic algorithms) Beam search: keep k best path at each step. Greedy search: forward selection or backward elimination. Floating search: Alternate forward and backward strategies. Search Strategies ( chap. 4) (Juha Reunanen, 2006)

Forward Selection (wrapper) n n-1 n-2 1 … Start Also referred to as SFS: Sequential Forward Selection

Guided search: we do not consider alternative paths. Typical ex.: Gram-Schmidt orthog. and tree classifiers. Forward Selection (embedded) … Start n n-1 n-2 1

Backward Elimination (wrapper) 1 n-2 n-1 n … Start Also referred to as SBS: Sequential Backward Selection

Backward Elimination (embedded) … Start 1 n-2 n-1 n Guided search: we do not consider alternative paths. Typical ex.: “recursive feature elimination” RFE-SVM.

Scaling Factors Idea: Transform a discrete space into a continuous space. Discrete indicators of feature presence:  i  {0, 1} Continuous scaling factors:  i  IR  =[  1,  2,  3,  4 ] Now we can do gradient descent!

Many learning algorithms are cast into a minimization of some regularized functional: Empirical error Regularization capacity control Formalism ( chap. 5) (Lal, Chapelle, Weston, Elisseeff, 2006) Justification of RFE and many other embedded methods.

Embedded method Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms: –Find a functional that represents your prior knowledge about what a good model is. –Add the  weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently –Optimize alternatively according to  and  –Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Causality

What can go wrong? Guyon-Aliferis-Elisseeff, 2007

What can go wrong?

Guyon-Aliferis-Elisseeff, 2007 X2X2 Y X1X1

Lung Cancer SmokingGenetics Coughing Attention Disorder Allergy AnxietyPeer Pressure Yellow Fingers Car Accident Born an Even Day Fatigue Local causal graph

What works and why?

Bilevel optimization 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. –Repeat and average if you want to reduce variance (cross- validation). 3) Test on test data. N variables/features M samples m1m1 m2m2 m3m3 Split data into 3 sets: training, validation, and test set.

Complexity of Feature Selection Generalization_error  Validation_error +  (C/m 2 ) m 2 : number of validation examples, N: total number of features, n: feature subset size. With high probability: n Error Try to keep C of the order of m 2.

Insensitivity to irrelevant features Simple univariate predictive model, binary target and features, all relevant features correlate perfectly with the target, all irrelevant features randomly drawn. With 98% confidence, abs(feat_weight) < w and  i w i x i < v. n g number of “ good ” (relevant) features n b number of “ bad ” (irrelevant) features m number of training examples.

Conclusion Feature selection focuses on uncovering subsets of variables X 1, X 2, … predictive of the target Y. Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice. Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance. Feature selection and causal discovery may be more harmful than useful. Causality can help ML but ML can also help causality