Feature selection and causal discovery fundamentals and applications Isabelle Guyon

Slides:



Advertisements
Similar presentations
Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3
Advertisements

The Software Infrastructure for Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke
Neural networks Introduction Fitting neural networks
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
Minimum Redundancy and Maximum Relevance Feature Selection
Lecture 5: Causality and Feature Selection Isabelle Guyon
Lecture 4: Embedded methods
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.
Feature Selection Presented by: Nafise Hatamikhah
Lecture 3: Introduction to Feature Selection Isabelle Guyon
Lecture 14 – Neural Networks
Sparse vs. Ensemble Approaches to Supervised Learning
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Feature Selection for Regression Problems
Reduced Support Vector Machine
Machine Learning CMPT 726 Simon Fraser University
Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning.
Lecture 2: Introduction to Feature Selection Isabelle Guyon
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Classification and application in Remote Sensing.
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Feature Selection Lecture 5
Feature Selection Bioinformatics Data Analysis and Tools
Lecture 6: Causal Discovery Isabelle Guyon
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Feature selection methods Isabelle Guyon IPAM summer school on Mathematics in Brain Imaging. July 2008.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
Ensemble Learning (2), Tree and Forest
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Baseline Methods for the Feature Extraction Class Isabelle Guyon Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Prediction model building and feature selection with SVM in breast cancer diagnosis Cheng-Lung Huang, Hung-Chang Liao, Mu- Chen Chen Expert Systems with.
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.
315 Feature Selection. 316 Goals –What is Feature Selection for classification? –Why feature selection is important? –What is the filter and what is the.
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Lecture 5: Causality and Feature Selection Isabelle Guyon
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Challenges in causality: Results of the WCCI 2008 challenge Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt Univ. André.
Dimensionality Reduction in Unsupervised Learning of Conditional Gaussian Networks Authors: Pegna, J.M., Lozano, J.A., Larragnaga, P., and Inza, I. In.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Boosting and Additive Trees (2)
Data Mining (and machine learning)
Probabilistic Models for Linear Regression
Lecture 3: Causality and Feature Selection
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006
Collaborative Filtering Matrix Factorization Approach
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Feature selection and causal discovery fundamentals and applications Isabelle Guyon

Feature Selection Thousands to millions of low level features: select the most relevant one to build better, faster, and easier to understand learning machines. X n m n’

Leukemia Diagnosis Golub et al, Science Vol 286:15 Oct m n’ {y i }, i=1:m {-y i }

RFE SVM, Guyon, Weston, et al US patent 7,117,188 Application to prostate cancer. Elisseeff-Weston, 2001 U29589 HOXC8 RACH1 BPH G4 G3 Prostate Cancer Genes

Differenciation of 14 tumors. Ramaswamy et al, PNAS, 2001 RFE SVM for cancer diagnosis

QSAR: Drug Screening Binding to Thrombin (DuPont Pharmaceuticals) compounds tested for their ability to bind to a target site on thrombin, a key receptor in blood clotting; 192 “active” (bind well); the rest “inactive”. Training set (1909 compounds) more depleted in active compounds ,351 binary features, which describe three-dimensional properties of the molecule. Weston et al, Bioinformatics, 2002 Number of features

Text Filtering Bekkerman et al, JMLR, 2003 Top 3 words of some categories: Alt.atheism: atheism, atheists, morality Comp.graphics: image, jpeg, graphics Sci.space: space, nasa, orbit Soc.religion.christian: god, church, sin Talk.politics.mideast: israel, armenian, turkish Talk.religion.misc: jesus, god, jehovah Reuters: news wire, 114 semantic categories. 20 newsgroups: articles, 20 categories. WebKB: 8282 web pages, 7 categories. Bag-of-words: > features.

Face Recognition Relief: Simba: Male/female classification 1450 images (1000 train, 450 test), 5100 features (images 60x85 pixels) Navot-Bachrach-Tishby, ICML 2004

Nomenclature Univariate method: considers one variable (feature) at a time. Multivariate method: considers subsets of variables (features) together. Filter method: ranks features or feature subsets independently of the predictor (classifier). Wrapper method: uses a classifier to assess features or feature subsets.

Univariate Filter Methods

Individual Feature Irrelevance P(X i, Y) = P(X i ) P(Y) P(X i | Y) = P(X i ) P(X i | Y=1) = P(X i | Y=-1) xixi density Legend: Y=1 Y=-1

S2N -- ++ -- ++ Golub et al, Science Vol 286:15 Oct m {-y i } {y i } S2N = |  + -  -|  + +  - S2N  R  x  y after “standardization” x  (x-  x )/  x

Univariate Dependence Independence: P(X, Y) = P(X) P(Y) Measure of dependence: MI(X, Y) =  P(X,Y) log dX dY = KL ( P(X,Y) || P(X)P(Y) ) P(X,Y) P(X)P(Y)

A choice of feature selection ranking methods depending on the nature of: the variables and the target (binary, categorical, continuous) the problem (dependencies between variables, linear/non- linear relationships between variables and target) the available data (number of examples and number of variables, noise in data) the available tabulated statistics. Other criteria ( chap. 3)

T-test Normally distributed classes, equal variance  2 unknown; estimated from data as  2 within. Null hypothesis H 0 :  + =  - T statistic: If H 0 is true, t= (  + -  -)/(  within  m + +1/m -  Student  m + +m - -  d.f.  -- ++ -- ++ P(X i |Y=-1) P(X i |Y=1) xixi

H 0 : X and Y are independent. Relevance index  test statistic. Pvalue  false positive rate FPR = n fp / n irr Multiple testing problem: use Bonferroni correction pval  n pval False discovery rate: FDR = n fp / n sc  FPR n/n sc Probe method: FPR  n sp /n p pval r0r0 r Null distribution Statistical tests ( chap. 2)

Multivariate Methods

Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006

Filters,Wrappers, and Embedded methods All features Filter Feature subset Predictor All features Wrapper Multiple Feature subsets Predictor All features Embedded method Feature subset Predictor

Relief nearest hit nearest miss D hit D miss Relief= D hit D miss Kira and Rendell, 1992

Wrappers for feature selection N features, 2 N possible feature subsets! Kohavi-John, 1997

Exhaustive search. Simulated annealing, genetic algorithms. Beam search: keep k best path at each step. Greedy search: forward selection or backward elimination. PTA(l,r): plus l, take away r – at each step, run SFS l times then SBS r times. Floating search (SFFS and SBFS): One step of SFS (resp. SBS), then SBS (resp. SFS) as long as we find better subsets than those of the same size obtained so far. Any time, if a better subset of the same size was already found, switch abruptly. n-k g Search Strategies ( chap. 4)

Feature subset assessment 1) For each feature subset, train predictor on training data. 2) Select the feature subset, which performs best on validation data. –Repeat and average if you want to reduce variance (cross- validation). 3) Test on test data. N variables/features M samples m1m1 m2m2 m3m3 Split data into 3 sets: training, validation, and test set.

Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset relevance Performance learning machine Search Criterion Assessment Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset relevance Performance learning machine Three “Ingredients” Statistical tests Single feature ranking Cross validation Performance bounds Nested subset, forward selection/ backward elimination Heuristic or stochastic search Exhaustive search Single feature relevance Relevance in context Feature subset relevance Performance learning machine

Forward Selection (wrapper) n n-1 n-2 1 … Start Also referred to as SFS: Sequential Forward Selection

Guided search: we do not consider alternative paths. Forward Selection (embedded) … Start n n-1 n-2 1

Forward Selection with GS Select a first feature X ν(1) with maximum cosine with the target cos(x i, y)=x.y/||x|| ||y|| For each remaining feature X i –Project X i and the target Y on the null space of the features already selected –Compute the cosine of X i with the target in the projection Select the feature X ν(k) with maximum cosine with the target in the projection. Embedded method for the linear least square predictor Stoppiglia, Gram-Schmidt orthogonalization.

Forward Selection w. Trees Tree classifiers, like CART (Breiman, 1984) or C4.5 (Quinlan, 1993) At each step, choose the feature that “reduces entropy” most. Work towards “node purity”. All the data f1f1 f2f2 Choose f 1 Choose f 2

Backward Elimination (wrapper) 1 n-2 n-1 n … Start Also referred to as SBS: Sequential Backward Selection

Backward Elimination (embedded) … Start 1 n-2 n-1 n

Backward Elimination: RFE Start with all the features. Train a learning machine f on the current subset of features by minimizing a risk functional J[f]. For each (remaining) feature X i, estimate, without retraining f, the change in J[f] resulting from the removal of X i. Remove the feature X ν(k) that results in improving or least degrading J. Embedded method for SVM, kernel methods, neural nets. RFE-SVM, Guyon, Weston, et al, US patent 7,117,188

Scaling Factors Idea: Transform a discrete space into a continuous space. Discrete indicators of feature presence:  i  {0, 1} Continuous scaling factors:  i  IR  =[  1,  2,  3,  4 ] Now we can do gradient descent!

Learning with scaling factors X={x ij } n m y ={y j }  

Many learning algorithms are cast into a minimization of some regularized functional: Empirical error Regularization capacity control Formalism ( chap. 5) Next few slides: André Elisseeff

Add/Remove features It can be shown (under some conditions) that the removal of one feature will induce a change in G proportional to: Examples: SVMs Gradient of f wrt. i th feature at point x k

Recursive Feature Elimination Minimize estimate of R( ,  ) wrt.  Minimize the estimate R( ,  ) wrt.  and under a constraint that only limited number of features must be selected

Gradient descent How to minimize ? Most approaches use the following method: Gradient step in [0,1] n. Would it make sense to perform just a gradient step here too? Mixes w. many algo. but heavy computations and local minima.

Minimization of a sparsity function Minimize the number of features used Replace by another objective function: –l 1 norm: –Differentiable function: Optimize jointly with the primary objective (good prediction of a target).

The l 1 SVM The version of the SVM where ||w|| 2 is replace by the l 1 norm  i |w i | can be considered as an embedded method: –Only a limited number of weights will be non zero (tend to remove redundant features) –Difference from the regular SVM where redundant features are all included (non zero weights) Bi et al 2003, Zhu et al, 2003

Ridge regression w1w1 w2w2 w*w* J = ||w|| / ||w-w*|| 2 Mechanical interpretation w2w2 J = ||w|| ||w-w*|| 2 w*w* w1w1 w1w1 w2w2 w*w* J = ||w|| 1 + 1/ ||w-w*|| 2 Lasso Tibshirani, 1996

Replace the regularizer ||w|| 2 by the l 0 norm Further replace by  i log(  + |w i |) Boils down to the following multiplicative update algorithm: The l 0 SVM Weston et al, 2003

Embedded method - summary Embedded methods are a good inspiration to design new feature selection techniques for your own algorithms: –Find a functional that represents your prior knowledge about what a good model is. –Add the  weights into the functional and make sure it’s either differentiable or you can perform a sensitivity analysis efficiently –Optimize alternatively according to  and  –Use early stopping (validation set) or your own stopping criterion to stop and select the subset of features Embedded methods are therefore not too far from wrapper techniques and can be extended to multiclass, regression, etc…

Causality

Variable/feature selection Remove features X i to improve (or least degrade) prediction of Y. X Y

What can go wrong? Guyon-Aliferis-Elisseeff, 2007

What can go wrong?

Guyon-Aliferis-Elisseeff, 2007 X2X2 Y X1X1 Y X1X1 X2X2

X Y Causal feature selection Uncover causal relationships between X i and Y. Y

Coughing Allergy Smoking Anxiety Genetic factor1 Hormonal factor Metastasis (b) Other cancers Lung cancer Genetic factor2 Tar in lungs Bio- marker2 Biomarker1 Systematic noise Causal feature relevance

Formalism: Causal Bayesian networks Bayesian network: –Graph with random variables X 1, X 2, …X n as nodes. –Dependencies represented by edges. –Allow us to compute P(X 1, X 2, …X n ) as  i P( X i | Parents(X i ) ). –Edge directions have no meaning. Causal Bayesian network: egde directions indicate causality.

Example of Causal Discovery Algorithm Algorithm: PC (Peter Spirtes and Clarck Glymour, 1999) Let A, B, C  X and V  X. Initialize with a fully connected un-oriented graph. 1. Find un-oriented edges by using the criterion that variable A shares a direct edge with variable B iff no subset of other variables V can render them conditionally independent (A  B | V). 2. Orient edges in “collider” triplets (i.e., of the type: A  C  B) using the criterion that if there are direct edges between A, C and between C and B, but not between A and B, then A  C  B, iff there is no subset V containing C such that A  B | V. 3. Further orient edges with a constraint-propagation method by adding orientations until no further orientation can be produced, using the two following criteria: (i) If A  B  …  C, and A — C (i.e. there is an undirected edge between A and C) then A  C. (ii) If A  B — C then B  C.

Computational and statistical complexity Computing the full causal graph poses: Computational challenges (intractable for large numbers of variables) Statistical challenges (difficulty of estimation of conditional probabilities for many var. w. few samples). Compromise: Develop algorithms with good average- case performance, tractable for many real-life datasets. Abandon learning the full causal graph and instead develop methods that learn a local neighborhood. Abandon learning the fully oriented causal graph and instead develop methods that learn unoriented graphs.

Target Y A prototypical MB algo: HITON Aliferis-Tsamardinos-Statnikov, 2003)

Target Y 1 – Identify variables with direct edges to the target (parent/children) Aliferis-Tsamardinos-Statnikov, 2003)

Target Y Aliferis-Tsamardinos-Statnikov, 2003) A B Iteration 1: add A Iteration 2: add B Iteration 3: remove A because A  Y | B etc. A A B B 1 – Identify variables with direct edges to the target (parent/children)

Target Y Aliferis-Tsamardinos-Statnikov, 2003) 2 – Repeat algorithm for parents and children of Y (get depth two relatives)

Target Y Aliferis-Tsamardinos-Statnikov, 2003) 3 – Remove non-members of the MB A member A of PCPC that is not in PC is a member of the Markov Blanket if there is some member of PC B, such that A becomes conditionally dependent with Y conditioned on any subset of the remaining variables and B. A B

Wrapping up

Complexity of Feature Selection Generalization_error  Validation_error +  (C/m 2 ) m 2 : number of validation examples, N: total number of features, n: feature subset size. With high probability: n Error Try to keep C of the order of m 2.

Examples of FS algorithms keep C = O(m 2 ) keep C = O(m 1 ) Nearest Neighbors Neural Nets Trees, SVM Mutual information feature ranking Non-linear RFE with linear SVM or LDA T-test, AUC, feature ranking Linear MultivariateUnivariate

The CLOP Package CLOP=Challenge Learning Object Package. Based on the Matlab® Spider package developed at the Max Planck Institute. Two basic abstractions: –Data object –Model object Typical script: –D = data(X,Y); % Data constructor –M = kridge; % Model constructor –[R, Mt] = train(M, D); % Train model=>Mt –Dt = data(Xt, Yt); % Test data constructor –Rt = test(Mt, Dt); % Test the model

NIPS 2003 FS challenge

Conclusion Feature selection focuses on uncovering subsets of variables X 1, X 2, … predictive of the target Y. Multivariate feature selection is in principle more powerful than univariate feature selection, but not always in practice. Taking a closer look at the type of dependencies in terms of causal relationships may help refining the notion of variable relevance.

1)Feature Extraction, Foundations and Applications I. Guyon et al, Eds. Springer, ) Causal feature selection I. Guyon, C. Aliferis, A. Elisseeff To appear in “Computational Methods of Feature Selection”, Huan Liu and Hiroshi Motoda Eds., Chapman and Hall/CRC Press, Acknowledgements and references