Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University.

Slides:

Advertisements

Similar presentations

Alexander Statnikov1, Douglas Hardin1,2, Constantin Aliferis1,3

Advertisements

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.

LECTURE 11: BAYESIAN PARAMETER ESTIMATION

Dynamic Bayesian Networks (DBNs)

Can causal models be evaluated? Isabelle Guyon ClopiNet / ChaLearn

Reasoning Under Uncertainty: Bayesian networks intro Jim Little Uncertainty 4 November 7, 2014 Textbook §6.3, 6.3.1, 6.5, 6.5.1,

Lecture 5: Causality and Feature Selection Isabelle Guyon

Introduction of Probabilistic Reasoning and Bayesian Networks

Lecture 4: Embedded methods

Using Markov Blankets for Causal Structure Learning Jean-Philippe Pellet Andre Ellisseeff Presented by Na Dai.

Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.

Causality Workbenchclopinet.com/causality Results of the Causality Challenge Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Lecture 3: Introduction to Feature Selection Isabelle Guyon

Review: Bayesian learning and inference

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11

Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.

Feature selection methods from correlation to causality Isabelle Guyon NIPS 2008 workshop on kernel learning.

Lecture 2: Introduction to Feature Selection Isabelle Guyon

Lecture 6: Causal Discovery Isabelle Guyon

CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.

Feature selection methods Isabelle Guyon IPAM summer school on Mathematics in Brain Imaging. July 2008.

Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.

Causal Modeling for Anomaly Detection Andrew Arnold Machine Learning Department, Carnegie Mellon University Summer Project with Naoki Abe Predictive Modeling.

Feature selection and causal discovery fundamentals and applications Isabelle Guyon

Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.

Bayes Net Perspectives on Causation and Causal Inference

Bayesian Decision Theory Making Decisions Under uncertainty 1.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Introduction to Defect Prediction Cmpe 589 Spring 2008.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

Soc 3306a Multiple Regression Testing a Model and Interpreting Coefficients.

Reasoning Under Uncertainty: Bayesian networks intro CPSC 322 – Uncertainty 4 Textbook §6.3 – March 23, 2011.

Causality challenge #2: Pot-Luck

CP Summer School Modelling for Constraint Programming Barbara Smith 2. Implied Constraints, Optimization, Dominance Rules.

Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.

V13: Causality Aims: (1) understand the causal relationships between the variables of a network (2) interpret a Bayesian network as a causal model whose.

On Learning Parsimonious Models for Extracting Consumer Opinions International Conference on System Sciences 2005 Xue Bai and Rema Padman The John Heinz.

Lecture 5: Causality and Feature Selection Isabelle Guyon

Ch 8. Graphical Models Pattern Recognition and Machine Learning, C. M. Bishop, Revised by M.-O. Heo Summarized by J.W. Nam Biointelligence Laboratory,

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

273 Discovery of Causal Structure Using Causal Probabilistic Network Induction AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos.

NIPS 2001 Workshop on Feature/Variable Selection Isabelle Guyon BIOwulf Technologies.

Computing & Information Sciences Kansas State University Data Sciences Summer Institute Multimodal Information Access and Synthesis Learning and Reasoning.

INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Slides for “Data Mining” by I. H. Witten and E. Frank.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Challenges in causality: Results of the WCCI 2008 challenge Isabelle Guyon, Clopinet Constantin Aliferis and Alexander Statnikov, Vanderbilt Univ. André.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

276 Causal Discovery Methods Using Causal Probabilistic Networks MEDINFO 2004, T02: Machine Learning Methods for Decision Support and Discovery Constantin.

Machine Learning 5. Parametric Methods.

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.

Linear Models (II) Rong Jin. Recap  Classification problems Inputs x  output y y is from a discrete set Example: height 1.8m  male/female?  Statistical.

Pattern Classification All materials in these slides* were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.

Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.

CS479/679 Pattern Recognition Dr. George Bebis

CHAPTER 9 Testing a Claim

Lecture 3: Causality and Feature Selection

Feature Selection Ioannis Tsamardinos Machine Learning Course, 2006

CHAPTER 9 Testing a Claim

LECTURE 07: BAYESIAN ESTIMATION

CHAPTER 9 Testing a Claim

CHAPTER 9 Testing a Claim

Presentation transcript:

Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University

Road Map What is feature selection? Why is it hard? What works best in practice? How to make progress using causality? Can causal discovery benefit from feature selection? Feature selection Causal discovery

Introduction

Causal discovery What affects your health? What affects the economy? What affects climate changes? and… Which actions will have beneficial effects?

Feature Selection Remove features X i to improve (or least degrade) prediction of Y. X Y

Uncovering Dependencies Factors of variability Artifactual Actual Unknown Known Uncontrollable Observable Unobservable Controllable

Predictions and Actions X Y See e.g. Judea Pearl, “Causality”, 2000

Predictive power of causes and effects Lung disease Coughing Allergy Smoking Anxiety Smoking is a better predictor of lung disease than coughing.

“Causal feature selection” Abandon the usual motto of predictive modeling: “we don’t care about causality”. Feature selection may benefit from introducing a notion of causality: –To be able to predict the consequence of given actions. –To add robustness to the predictions if the input distribution changes. –To get more compact and robust feature sets.

“FS-enabled causal discovery” Isn’t causal discovery solved with experiments? No! Randomized Controlled Trials (RCT) may be: –Unethical (e.g. a RCT about the effects of smoking) –Costly and time consuming –Impossible (e.g. astronomy) Observational data may be available to help plan future experiments  Causal discovery may benefit from feature selection.

Feature selection basics

Individual Feature Irrelevance P(X i, Y) = P(X i ) P(Y) P(X i | Y) = P(X i ) xixi density

Individual Feature Relevance -- ++ -- ++ 1 Specificity Sensitivity ROC curve AUC 0 1

Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006

Multivariate FS is complex n features, 2 n possible feature subsets! Kohavi-John, 1997

FS strategies Wrappers: –Use the target risk functional to evaluate feature subsets. –Train one learning machine for each feature subset investigated. Filters: –Use another evaluation function than the target risk functional. –Often no learning machine is involved in the feature selection process.

Reducing complexity For wrappers: –Use forward or backward selection: O(n 2 ) steps. –Mix forward and backward search, e.g. floating search. For filters: –Use a cheap evaluation function (no learning machine). –Make independence assumptions: n evaluations. Embedded methods: –Do not retrain the LM at every step: e.g. RFE, n steps. –Search FS space and LM parameter space simultaneously: e.g. 1-norm/Lasso approaches.

In practice… Univariate feature selection often yields better accuracy results than multivariate feature selection. NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters. Multivariate methods usually claim only better “parsimony”. How can we make multivariate FS work better? NIPS 2003 and WCCI 2006 challenges :

Definition of “irrelevance” We want to determine whether one variable X i is “relevant” to the target Y. Surely irrelevant feature: P(X i, Y |S \i ) = P(X i |S \i )P(Y |S \i ) for all S \i  X \i for all assignment of values to S \i Are all non-irrelevant features relevant?

Causality enters the picture

Causal Bayesian networks Bayesian network: –Graph with random variables X 1, X 2, …X n as nodes. –Dependencies represented by edges. –Allow us to compute P(X 1, X 2, …X n ) as  i P( X i | Parents(X i ) ). –Edge directions have no meaning. Causal Bayesian network: egde directions indicate causality.

Markov blanket Lung disease Coughing Allergy Smoking Anxiety A node is conditionally independent of all other nodes given its Markov blanket.

Relevance revisited In terms of Bayesian networks in “faithful” distributions: Strongly relevant features = members of the Markov Blanket Weakly relevant features = variables with a path to the Markov Blanket but not in the Markov Blanket Irrelevant features = variables with no path to the Markov Blanket Koller-Sahami, 1996; Kohavi-John, 1997; Aliferis et al., 2002.

Is X 2 “relevant”? X 2 || Y baseline (X 2 ) health (Y) peak (X 1 ) P(X 1, X 2, Y)= P(X 1 | X 2, Y) P(X 2 ) P(Y) peak baseline Y normal disease x1x1 x2x2 X 2 || Y | X 1 1

Are X 1 and X 2 “relevant”? time (X 2 ) health (Y) peak (X 1 ) P(X 1, X 2, Y)= P(X 1 | X 2, Y) P(X 2 ) P(Y) X 1 || Y X 2 || Y X 1 || X 2 peak sample processing time normal disease Y 2

XOR and unfaithfulness X1X1 Y X2X2 X 1 || Y X 2 || Y X 1 || X 2 Y = X 1  X 2 X1X1 X2X2 Y Example: X 1 and X 2 : Two fair coins tossed at random Y: Win if both coins end on the same side X1X1 Y X2X2 X1X1 Y X2X2 X1X1 Y X2X2

y x1x1 Adding a variable… … can make another one irrelevant y x1x1 X2X2 Simpson’s paradox X 1 || Y | X 2 3

y x1x1 … conclusion: no evidence that eating chocolate makes you live longer. X 1 || Y | X 2 Is chocolate good for your health? chocolate intake life expectancy y x1x1 Male Female X 2 =gender 3

y x1x1 … conclusion: eating chocolate may make you live longer! Really? Is chocolate good for your health? chocolate intake life expectancy y x1x1 Depressed Happy X 2 =mood 3

Same independence relations Different causal relations P(X 1, X 2, Y) = P(X 1 | X 2 ) P(Y | X 2 ) P(X 2 ) X 1 || Y | X 2 X1X1 YX2X2 X1X1 YX2X2 P(X 1, X 2, Y) = P(Y | X 2 ) P(X 2 | X 1 ) P(X 1 ) P(X 1, X 2, Y) = P(X 1 | X 2 ) P(X 2 | Y) P(Y) X1X1 YX2X2

Is X 1 “relevant”? X 1 || Y | X 2 chocolate intake (X 1 ) life expectancy (Y) gender (X 2 ) chocolate intake (X 1 ) life expectancy (Y) mood (X 2 ) 3

Non-causal features may be predictive yet not “relevant” baseline (X 2 ) health (Y) peak (X 1 ) time (X 2 ) health (Y) peak (X 1 ) chocolate intake (X 1 ) life expectancy (Y) gender (X 2 ) chocolate intake (X 1 ) life expectancy (Y) mood (X 2 ) 1 2 3

x2x2 x1x1 Causal feature discovery x2x2 x1x1 P(X,Y) = P(X|Y) P(Y) Y X1X1 X2X2 P(X,Y) = P(Y|X) P(X) Y X1X1 X2X2 Sun-Janzing-Schoelkopf, 2005

Conclusion Feature selection focuses on uncovering subsets of variables X 1, X 2, … predictive of the target Y. Taking a closer look at the type of dependencies may help refining the notion of variable relevance. Uncovering causal relationships may yield better feature selection, robust under distribution changes. These “causal features” may be better targets of action.