Fast, Accurate Causal Search Algorithms from the Center for Causal Discovery (CCD) The CCD Algorithms Group University of Pittsburgh Carnegie Mellon.

Slides:



Advertisements
Similar presentations
Grant review at NIH for statistical methodology Jeremy M G Taylor Michelle Dunn Marie Davidian.
Advertisements

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
Topic Outline Motivation Representing/Modeling Causal Systems
1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.
Bayesian Biosurveillance Gregory F. Cooper Center for Biomedical Informatics University of Pittsburgh The research described in this.
Planning for Inquiry The Learning Cycle. What do I want the students to know and understand? Take a few minutes to observe the system to be studied. What.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Engineering Data Analysis & Modeling Practical Solutions to Practical Problems Dr. James McNames Biomedical Signal Processing Laboratory Electrical & Computer.
1 gR2002 Peter Spirtes Carnegie Mellon University.
Research Methodologies in Allied Health SAHP 418/518 Research Planning Sandra Gunselman, Ph.D.
Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
Bayes Net Perspectives on Causation and Causal Inference
Attention Deficit Hyperactivity Disorder (ADHD) Student Classification Using Genetic Algorithm and Artificial Neural Network S. Yenaeng 1, S. Saelee 2.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Funding Opportunities at the NIH and the National Institute of Biomedical Imaging and Bioengineering Grace C.Y. Peng, Ph.D. March 19, st annual ORNL.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Methodological Problems in Cognitive Psychology David Danks Institute for Human & Machine Cognition January 10, 2003.
Course files
Lecture 02.
Computing & Information Sciences Kansas State University Data Sciences Summer Institute Multimodal Information Access and Synthesis Learning and Reasoning.
©2010 John Wiley and Sons Chapter 2 Research Methods in Human-Computer Interaction Chapter 2- Experimental Research.
Center for Causal Discovery (CCD) of Biomedical Knowledge from Big Data University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center.
Dependency Networks for Collaborative Filtering and Data Visualization UAI-2000 발표 : 황규백.
Center for Causal Discovery (CCD) Training Plan Overview Joe Ayoob, Training Component Co-I June 24, 2015 Center Director: Greg Cooper Training Component.
Why Write A Grant? Elaine M. Hylek, MD, MPH Professor of Medicine Associate Director, Education and Training Division BU CTSI Section of General Internal.
Data Mining and Decision Support
Clinical Research Informatics [CRI]. Informatics, defined generally as the intersection of information and computer science with a health-related discipline,
Network applications Sushmita Roy BMI/CS 576 Dec 9 th, 2014.
1 Day 2: Search June 9, 2015 Carnegie Mellon University Center for Causal Discovery.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
INFERENCE FOR BIG DATA Mike Daniels The University of Texas at Austin Department of Statistics & Data Sciences Department of Integrative Biology.
Kelci J. Miclaus, PhD Advanced Analytics R&D Manager JMP Life Sciences
TITIN ANDRI WIHASTUTI SCHOOL OF NURSING FACULTY OF MEDICINE
Journal club Jun , Zhen.
Research Problems, Purposes, & Hypotheses
Areas of Research Xia Jiang Associate Professor of
Learning gene regulatory networks in Arabidopsis thaliana
An Artificial Intelligence Approach to Precision Oncology
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
Discovery and Dissemination
Gregory Cooper Professor of Biomedical Informatics Director, Center for Causal Discovery Vice Chair Research, Department of Biomedical Informatics.
From Brain Images to Causal Connections Center for Causal Discovery (CCD) BD2K All Hands Meeting University of Pittsburgh Carnegie Mellon University Pittsburgh.
Introduction to translational and clinical bioinformatics Connecting complex molecular information to clinically relevant decisions using molecular.
Lecture 02.
Hypothesis Testing, Validity, and Threats to Validity
Biostatistics?.
Emma Stumpf-- Biomedical Engineering
Dept of Biomedical Informatics University of Pittsburgh
Irina Rish IBM T.J.Watson Research Center
From Bench to Clinical Applications: Money Talks
Markov Properties of Directed Acyclic Graphs
Discovery and Dissemination
Areas of Research Xia Jiang Assistant Professor
Gregory Cooper Professor of Biomedical Informatics Director, Center for Causal Discovery Vice Chair, Department of Biomedical Informatics Research involves.
A Short Tutorial on Causal Network Modeling and Discovery
Center for Causal Discovery: Summer Short Course/Datathon
From Data to Therapies Research in Xinghua Lu’s Lab
Denise Esserman, PhD Department of Biostatistics, Yale University
Gregory Cooper Professor of Biomedical Informatics Director, Center for Causal Discovery Vice Chair Research, Department of Biomedical Informatics.
University of Pittsburgh
Extra Slides.
Lesson Using Studies Wisely.
Causal Models Lecture 12.
Plasma levels of interleukin-1 receptor antagonist (IL1Ra) predict radiographic progression of symptomatic knee osteoarthritis  M. Attur, A. Statnikov,
Discovery of Hidden Structure in High-Dimensional Data
Searching for Graphical Causal Models of Education Data
Altered Caspase-8 Expression
Cancer Challenge Area: Hypothesis Generation Using Machine Learning Amber Simpson, Memorial Sloan Kettering Cancer Center Jeremy Goecks, Oregon.
Presentation transcript:

Fast, Accurate Causal Search Algorithms from the Center for Causal Discovery (CCD) The CCD Algorithms Group University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center Yale University The project is a collaborative effort among investigators at Pitt, UPMC, CMU, and the Pittsburgh Supercomputer Center. Some of the key personnel on the project have been collaborating on methods for causal modeling and discovery for more than 20 years. One driving biomedical problem will be the discovery of cell signaling pathways in several cancers, including breast, lung, and colon cancer. Understanding these pathways is central to developing drug therapies that can effectively treat the cancers. Another driving biomedical problem will be the discovery of the mechanisms of disease in COPD. Both of these biomedical problems are translational and involve using data that spans from the molecular to the clinical. BD2K All Hands Meeting 11/29/2016

Causal Discovery in Biomedicine Science is centrally concerned with the discovery of causal relationships in nature Understanding Control Examples: Determine the genes and cell signaling pathways that cause breast cancer Discover the clinical effects of a new drug Uncover the mechanisms of pathogenicity of a recently mutated virus that is spreading rapidly in the population NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Why Establish a Center for Causal Discovery Now? Algorithmic Advances + Availability of Big Biomedical Data NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Algorithmic Advances In the past 25 years, there has been tremendous progress in the development of computational methods for representing and discovering causal networks from a combination of observational data, experimental data, and knowledge. These methods are generally applicable to biomedical data. NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Availability of Big Biomedical Data http://aldousvoice.files.wordpress.com/2014/06/database.jpg The variety, richness, and quantity of biomedical data have been increasing very rapidly. The appropriate analysis of these data has great potential to advance biomedical science. NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Primary Goals of the CCD Goal 1. Develop and implement state-of-the-art methods for discovering causal knowledge from biomedical big data using causal graphical models Make some of the best existing causal discovery methods available as free, open source software Develop new methods and make them available Goal 2. Investigate three biomedical projects (cancer, lung disease, brain functional connectivity) to evaluate methods and drive their further development Goal 3. Disseminate causal discovery software and knowledge widely to biomedical researchers and data scientists A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.

Typical Causal Analysis Workflow Prior Knowledge Causal Analysis Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network

Typical Causal Analysis Workflow Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network

Typical Causal Analysis Workflow Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network

Typical Causal Analysis Workflow Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network

Typical Causal Analysis Workflow Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Data Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Causal Network

Basic Components Needed to Learn Causal Networks from Data Model representation Model evaluation Model search Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.

Model Represenation Causal Bayesian network (CBN) Directed acyclic graph Nodes represent variables Arcs represent causal influence Specify P(X | parents(X)) for each X Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. This figure is adapted from: Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005) 523-529.

Model Representation with CBNs Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.

Model Representation Issues Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.

Model Evaluation Constraint based (e.g., tests of conditional independence) Score based (e.g., Bayesian scores) Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.

What is the Big Data Problem on which the CCD is Primarily Focused?

Number of variables (nodes) The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 * Assumes there are no latent variables and no directed cycles.

Number of variables (nodes) The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 25 4 543 * Assumes there are no latent variables and no directed cycles.

Number of variables (nodes) The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 25 4 543 5 29,281 6 3,781,503 7 1.1 x 109 8 7.8 x 1011 9 1.2 x 1015 10 4.2 x 1018 * Assumes there are no latent variables and no directed cycles.

Our Main Big Data Problem Analyze biomedical datasets containing a large number of variables in order to generate plausible hypotheses of the causal relationships that hold among those variables NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

An Example Algorithm for Causal Discovery with Many Variables: FGES GES: A popular CBN learning algorithm that uses greedy search and Bayesian scoring* We developed a fast version of GES, called FGES Optimized the single processor version of GES Parallelized GES NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. * Chickering DM. Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (2002) 507-554.

Evaluation of FGES Generated 10 random CBNs 30,000 nodes and 60,000 edges Continuous-variables with linear relationships and Gaussian noise Sampled each CBN to generate 1,000 cases Provided those cases to FGES and measured its ability to learn the data-generating CBN Average Directed Arc Precision Average Directed Arc Recall # Processors Average Learning Time 99% 84% 128 2.3 minutes NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. For more information: http://arxiv.org/ftp/arxiv/papers/1507/1507.07749.pdf Ramsey J, Glymour C. A Million Variables and More: The Fast Greedy Search (FGS) Algorithm for Learning High Dimensional Graphical Causal Models (to appear).

Another Example of an Algorithm for Causal Discovery with Many Variables: GFCI FGES assumes there are no latent confounders, that is, there are no latent variables that cause two or more measured variables Biomedical data often contain latent confounders GFCI* allows for the possibility of latent confounders NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.

Evaluation of GFCI Generated more than 100 random CBNs 1,000 nodes and 2,000 edges Continuous variables with linear Gaussian relationships Sampled each CBN to generate 2,000 cases Provided cases to GFCI and measured its performance % Latent Nodes Average Directed Arc Precision Recall # Processors Average Learning Time 5% 92% 93% 1 15 seconds NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. For more information: Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52, 368-379.

Ongoing Algorithm Work Includes … Modeling non-linear relationships Modeling causal feedback Handling a mixture of continuous and discrete variables Outputting uncertainty in edge relationships Learning the causal relationships among latent variables A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.

Summary Causal discovery is central to biomedical science The variety, richness, and quantity of biomedical data are increasing rapidly The CCD is providing software now for analyzing big biomedical data to discover causal relationships Causal discovery algorithms with additional capabilities will soon be available as well A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.

Acknowledgements Thanks to the members of the Algorithms Group of the Center for Causal Discovery for their contributions to the activities described in this talk. The Center for Causal Discovery is supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The content of this presentation is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.  NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

CCD software is available at: Thank you gfc@pitt.edu CCD software is available at: www.ccd.pitt.edu NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Extra Slides NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Association Versus Causation Represents statistical relationships Predicts outcomes from passive observations Example uses: classification and regression Causation: Represents mechanisms Predicts outcomes of active intervention Example uses: decision making and planning NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Example Association Causation Smoking – lung cancer – coughing Both smoking and coughing predict lung cancer Causation Smoking  lung cancer  coughing Smoking influences lung cancer Coughing does not influence lung cancer NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.

Recent Examples of the Use of Graphical Causal Discovery Methods Anticipation-related brain connectivity in bipolar and unipolar depression: A graph theory approach Anna Manelis, Jorge R. C. Almeida, Richelle Stiffler,1 Jeanette C. Lockovich, Haris A. Aslam, Mary L. Phillips. Brain 139 (2016) 2554-2566. Dobryakova, E., Costa, S. L., Wylie, G. R., DeLuca, J., & Genova, H. M. (2016). Altered effective connectivity during a processing speed task in individuals with multiple sclerosis. Journal of the International Neuropsychological Society: JINS, 22(2), 216-224. Otsuka, J. (2016). Discovering phenotypic causal structure from nonexperimental data. Journal of evolutionary biology, 29(6), 1268-1277. Attur, M., Statnikov, A., Samuels, J., Li, Z., Alekseyenko, A. V., Greenberg, J. D., et al. (2015). Plasma levels of interleukin-1 receptor antagonist (IL1Ra) predict radiographic progression of symptomatic knee osteoarthritis. Osteoarthritis and Cartilage, 23(11), 1915-1924.