The bumpy road of the search for a (good) cause Isabelle Guyon Dominik Janzing Bernhard Schölkopf
We know it’s important … …your health? What affects… …climate changes? … the economy? We are constantly facing problems of cause-effect relationships: what affects our health, the economy, climate changes, and which actions will have beneficial effects. …and which actions will have beneficial effects? … we use it all the time!
But we can’t even define it! Many definitions: Science Philosophy Law Psychology History Religion Engineering “Cause is the effect concealed, effect is the cause revealed” (Hindu philosophy) However there is no definition of causality encompassing all the notions it refers to in Science, Philosophy, Law, Psychology, History, Religion, Engineering. One of my favorite definitions comes from Hindu philosophy: “Cause is the effect concealed, effect is the cause revealed” . It indicates well that there is no effect without cause and vice versa.
Systemic causality The agent could learn! The causal system In engineering, there is a pretty well formalized notion of causality. The causal system The external agent
Difficulties Variability Confounding factors Sample bias Attrition bias In a perfect world in which we could observe and control everything, a single experiment would suffice to determine a cause-effect relationship. But the words is not perfect. There is a lot of variability we cannot control
To deal with variability… … we need experimental design
Success stories 1. Vitamin C and scurvy, Loyd 1750’s: A historical RCT. 2. Hygiene and infectious diseases, Semmelweis 1840’s; Pasteur 1860’s: Can you believe what you can’t see? Planned experiments in agriculture, Fisher 1930’s: Mathematical foundations of experimental design. Smoking and lung cancer: A long lasting debate, but better err on the safe side! NSAIDs, Aspirin, Phenacetin, Paracetamol, Vioxx: Drug efficacy vs. drug toxicity.
Statistical dependencies More difficulties A lot of “observational” data. Correlation Causality! Statistical dependencies Experiments are often needed, but: Costly Unethical Infeasible
Learning from observational data Can we do it?
Not your usual ML problem! Non i.i.d. data: Training set “natural” distribution Test set “manipulated” distribution No cross-validation for model selection
The good old DAG Lung Cancer Smoking Genetics Coughing Attention Disorder Allergy Anxiety Peer Pressure Yellow Fingers Car Accident Born an Even Day Fatigue Wright, 1921 Haavelmo, 1943 Dawid, Spiegelhalter, Lauritzen, Speed Cox, Wermuth, Pearl, Spirtes, Glymour, Scheines, Cooper, Neapolitan, Koller, Friedman
Beware of the DAG! Unsuited for: Intrinsic limitations: Assumptions: Feed-back loops and equilibria Symmetric relationships Constrained systems Intrinsic limitations: Markov equivalences Imperfect data (measurement errors, quantization, aggregation) Assumptions: Causal sufficiency Causal Markov assumption Causal faithfulness Linearity & Gaussianity
Success stories 1. Genetic epidemiology: Towards personalized medicine. 2. Mendelian randomization: Resolving reverse causation & confounding. 3. System biology: Reverse engineering the cell. 4. Social sciences: Assisting policy-making.
Causality and time Everyday notion of causality: The causes precede their effects Is that always true? Delayed measurements Final cause (objective) Reverse causation Other difficulties: Non i.i.d. samples: redundant; correlation misleading. Confounding is still a problem. Seasonality. Censored data.
Co2 and temperature Over the last 650,00 years, CO2 has correlated with temperature, but … … CO2 lags behind temperature several hundred years.
Climate changes … Meehl et al. (2004). "Combinations of Natural and Anthropogenic Forcings in Twentieth-Century Climate". Journal of Climate 17: 3721-3727.
Other time series Japan
Success story: Granger causality Nobel prize, 2003 Co-integration: Elimination of spurious correlations in non-stationary time series (x(t) and y(t) non-stationary but a x(t)+b y(t) stationary). Granger causality: x(t) y(t) if f(past x, past y) predicts better y(t) than f(past y). Co-integrated time series are in a Granger causality relationship. [does not eliminate confounding]
Open problems Final objective optimization Assessment methods Common assumptions Robust models Tradeoff efficiency/efficacy Data representation Data imperfections Heterogeneous information Mix observations and experiments Quantify uncertainty
Advertisement
The causality workbench What is the causal question? Why should we care? What is hard about it? Is this solvable? Is this a good benchmark? http://clopinet.com/causality
Causation and Prediction challenge Challenge datasets Toy datasets
Pot-Luck challenge Task Participants (views) Type CYTO 2 (609) LOCANET 10 (1372) PROMO 3 (862) SIGNET 2 (918) TIED 1 (551) CauseEffectPairs 5 (580) Stemmatology 0 (372) real self eval real artif artif self eval CYTO: Causal Protein-Signaling Networks in human T cells. Learn a protein signaling network from multicolor flow cytometry data. N=11 proteins, P~800 samples per experimental condition. E=9 conditions. LOCANET: LOcal CAusal NETwork. Find the local causal structure around a given target variable (depth 3 network) in REGED, CINA, SIDO, MARTI. PROMO: Simulated marketing task. Time series of 1000 promotion variables and 100 product sales. Predict a 1000x100 boolean influence matrix, indicating for each (i,j) element whether the ith promotion has a causal influence of the sales of the jth product. Data is provided as time series, with a daily value for each variable for three years. SIGNET: Abscisic Acid Signaling Network. Determine the set of 43 boolean rules that describe the interactions of the nodes within a plant signaling network. 300 separate Boolean pseudodynamic simulations of the true rules. Model inspired by a true biological system. TIED: Target Information Equivalent Dataset. Illustrates a case in which there are many equivalent Markov boundaries. Find them all. CAUSEEFFECTPAIRS: Find the causal direction in eight pairs of variables. STEMMATOLOGY: Reconstruct a family tree of documents derived from one another. artif artif real real self eval
Other donated datasets Task Views Type WebLogs 272 MIDS 232 NOISE 247 SECOM 297 SEFTI 280 real self eval artif real artif WebLogs: Recover the links from page to page from number of daily hits. The network consists of 20 pages. The training data includes 512 days. MIDS: Mixed Dynamic Systems - Simulated time-series data of 9 variables based on linear Gaussian models with no latent common causes, but with multiple dynamic processes. NOISE: Real and simulated EEG data. The goal it to find which region of the brain influences which other region. SECOM: Semiconductor manufacturing. Find the causes of failure in ~60 features corresponding to measurements in a fab line (Classification pb) SEFTI: Semiconductor manufacturing. Here the pb is a regression pb: find the tools that are guilty of performance degradation. real real http://clopinet.com/causality
Lessons learnt Causation and prediction challenge: Knowing the true causal relationships yields better models. Regular feature selection is hard to beat in practice. Cause-effect pairs task of pot-luck challenge: [problem posed by Mooij, Janzing, Schölkopf] The winners identified 8/8 correct causal directions. Methodology: We need to stage our effort Address sub-problems (like ranking causes). Mix observations and experiments.
Proceedings JMLR W&CP Volume 3: Causation and Prediction Challenge (WCCI 2008) I. Guyon, C. Aliferis, G. Cooper, A. Elisseeff, J.-P. Pellet, P. Spirtes, A. Statnikov, Eds. http://jmlr.csail.mit.edu/proceedings/papers/v3/ 2) JMLR W&CP Volume 6 (in press): Objective and Assessment Workshop (NIPS 2008) I. Guyon, D. Janzing, B. Schölkopf, Eds. http://jmlr.csail.mit.edu/proceedings/papers/v6/
Coming soon … Virtual Laboratory. Workshops: Challenges: NIPS09: Causality and time series analysis mini-symposium http://clopinet.com/isabelle/Projects/NIPS2009/ - WCCI09: Active learning. Challenges: End 2009/2010: Active learning. End 2010/2011: Experimental Design in Causal Modeling (ExpDeCo). 2012: Causal Model for System Identification and Control (CoMSIco).
Conclusion Causal discovery from observational data is not an impossible task, but a very hard one. This points to the need for further research and benchmark: Combining experiments and observations. Exploring both efficiency and efficacy. Connecting with related disciplines: RL, control. Don’t miss the upcoming events!