Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University
Road Map What is feature selection? Why is it hard? What works best in practice? How to make progress using causality? Can causal discovery benefit from feature selection? Feature selection Causal discovery
Introduction
Causal discovery What affects your health? What affects the economy? What affects climate changes? and… Which actions will have beneficial effects?
Feature Selection Remove features X i to improve (or least degrade) prediction of Y. X Y
Uncovering Dependencies Factors of variability Artifactual Actual Unknown Known Uncontrollable Observable Unobservable Controllable
Predictions and Actions X Y See e.g. Judea Pearl, “Causality”, 2000
Predictive power of causes and effects Lung disease Coughing Allergy Smoking Anxiety Smoking is a better predictor of lung disease than coughing.
“Causal feature selection” Abandon the usual motto of predictive modeling: “we don’t care about causality”. Feature selection may benefit from introducing a notion of causality: –To be able to predict the consequence of given actions. –To add robustness to the predictions if the input distribution changes. –To get more compact and robust feature sets.
“FS-enabled causal discovery” Isn’t causal discovery solved with experiments? No! Randomized Controlled Trials (RCT) may be: –Unethical (e.g. a RCT about the effects of smoking) –Costly and time consuming –Impossible (e.g. astronomy) Observational data may be available to help plan future experiments Causal discovery may benefit from feature selection.
Feature selection basics
Individual Feature Irrelevance P(X i, Y) = P(X i ) P(Y) P(X i | Y) = P(X i ) xixi density
Individual Feature Relevance -- ++ -- ++ 1 Specificity Sensitivity ROC curve AUC 0 1
Univariate selection may fail Guyon-Elisseeff, JMLR 2004; Springer 2006
Multivariate FS is complex n features, 2 n possible feature subsets! Kohavi-John, 1997
FS strategies Wrappers: –Use the target risk functional to evaluate feature subsets. –Train one learning machine for each feature subset investigated. Filters: –Use another evaluation function than the target risk functional. –Often no learning machine is involved in the feature selection process.
Reducing complexity For wrappers: –Use forward or backward selection: O(n 2 ) steps. –Mix forward and backward search, e.g. floating search. For filters: –Use a cheap evaluation function (no learning machine). –Make independence assumptions: n evaluations. Embedded methods: –Do not retrain the LM at every step: e.g. RFE, n steps. –Search FS space and LM parameter space simultaneously: e.g. 1-norm/Lasso approaches.
In practice… Univariate feature selection often yields better accuracy results than multivariate feature selection. NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters. Multivariate methods usually claim only better “parsimony”. How can we make multivariate FS work better? NIPS 2003 and WCCI 2006 challenges :
Definition of “irrelevance” We want to determine whether one variable X i is “relevant” to the target Y. Surely irrelevant feature: P(X i, Y |S \i ) = P(X i |S \i )P(Y |S \i ) for all S \i X \i for all assignment of values to S \i Are all non-irrelevant features relevant?
Causality enters the picture
Causal Bayesian networks Bayesian network: –Graph with random variables X 1, X 2, …X n as nodes. –Dependencies represented by edges. –Allow us to compute P(X 1, X 2, …X n ) as i P( X i | Parents(X i ) ). –Edge directions have no meaning. Causal Bayesian network: egde directions indicate causality.
Markov blanket Lung disease Coughing Allergy Smoking Anxiety A node is conditionally independent of all other nodes given its Markov blanket.
Relevance revisited In terms of Bayesian networks in “faithful” distributions: Strongly relevant features = members of the Markov Blanket Weakly relevant features = variables with a path to the Markov Blanket but not in the Markov Blanket Irrelevant features = variables with no path to the Markov Blanket Koller-Sahami, 1996; Kohavi-John, 1997; Aliferis et al., 2002.
Is X 2 “relevant”? X 2 || Y baseline (X 2 ) health (Y) peak (X 1 ) P(X 1, X 2, Y)= P(X 1 | X 2, Y) P(X 2 ) P(Y) peak baseline Y normal disease x1x1 x2x2 X 2 || Y | X 1 1
Are X 1 and X 2 “relevant”? time (X 2 ) health (Y) peak (X 1 ) P(X 1, X 2, Y)= P(X 1 | X 2, Y) P(X 2 ) P(Y) X 1 || Y X 2 || Y X 1 || X 2 peak sample processing time normal disease Y 2
XOR and unfaithfulness X1X1 Y X2X2 X 1 || Y X 2 || Y X 1 || X 2 Y = X 1 X 2 X1X1 X2X2 Y Example: X 1 and X 2 : Two fair coins tossed at random Y: Win if both coins end on the same side X1X1 Y X2X2 X1X1 Y X2X2 X1X1 Y X2X2
y x1x1 Adding a variable… … can make another one irrelevant y x1x1 X2X2 Simpson’s paradox X 1 || Y | X 2 3
y x1x1 … conclusion: no evidence that eating chocolate makes you live longer. X 1 || Y | X 2 Is chocolate good for your health? chocolate intake life expectancy y x1x1 Male Female X 2 =gender 3
y x1x1 … conclusion: eating chocolate may make you live longer! Really? Is chocolate good for your health? chocolate intake life expectancy y x1x1 Depressed Happy X 2 =mood 3
Same independence relations Different causal relations P(X 1, X 2, Y) = P(X 1 | X 2 ) P(Y | X 2 ) P(X 2 ) X 1 || Y | X 2 X1X1 YX2X2 X1X1 YX2X2 P(X 1, X 2, Y) = P(Y | X 2 ) P(X 2 | X 1 ) P(X 1 ) P(X 1, X 2, Y) = P(X 1 | X 2 ) P(X 2 | Y) P(Y) X1X1 YX2X2
Is X 1 “relevant”? X 1 || Y | X 2 chocolate intake (X 1 ) life expectancy (Y) gender (X 2 ) chocolate intake (X 1 ) life expectancy (Y) mood (X 2 ) 3
Non-causal features may be predictive yet not “relevant” baseline (X 2 ) health (Y) peak (X 1 ) time (X 2 ) health (Y) peak (X 1 ) chocolate intake (X 1 ) life expectancy (Y) gender (X 2 ) chocolate intake (X 1 ) life expectancy (Y) mood (X 2 ) 1 2 3
x2x2 x1x1 Causal feature discovery x2x2 x1x1 P(X,Y) = P(X|Y) P(Y) Y X1X1 X2X2 P(X,Y) = P(Y|X) P(X) Y X1X1 X2X2 Sun-Janzing-Schoelkopf, 2005
Conclusion Feature selection focuses on uncovering subsets of variables X 1, X 2, … predictive of the target Y. Taking a closer look at the type of dependencies may help refining the notion of variable relevance. Uncovering causal relationships may yield better feature selection, robust under distribution changes. These “causal features” may be better targets of action.