Download presentation
Presentation is loading. Please wait.
Published byErica Moody Modified over 6 years ago
1
Fast, Accurate Causal Search Algorithms from the Center for Causal Discovery (CCD)
The CCD Algorithms Group University of Pittsburgh Carnegie Mellon University Pittsburgh Supercomputing Center Yale University The project is a collaborative effort among investigators at Pitt, UPMC, CMU, and the Pittsburgh Supercomputer Center. Some of the key personnel on the project have been collaborating on methods for causal modeling and discovery for more than 20 years. One driving biomedical problem will be the discovery of cell signaling pathways in several cancers, including breast, lung, and colon cancer. Understanding these pathways is central to developing drug therapies that can effectively treat the cancers. Another driving biomedical problem will be the discovery of the mechanisms of disease in COPD. Both of these biomedical problems are translational and involve using data that spans from the molecular to the clinical. BD2K All Hands Meeting 11/29/2016
2
Causal Discovery in Biomedicine
Science is centrally concerned with the discovery of causal relationships in nature Understanding Control Examples: Determine the genes and cell signaling pathways that cause breast cancer Discover the clinical effects of a new drug Uncover the mechanisms of pathogenicity of a recently mutated virus that is spreading rapidly in the population NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
3
Why Establish a Center for Causal Discovery Now?
Algorithmic Advances + Availability of Big Biomedical Data NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
4
Algorithmic Advances In the past 25 years, there has been tremendous progress in the development of computational methods for representing and discovering causal networks from a combination of observational data, experimental data, and knowledge. These methods are generally applicable to biomedical data. NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
5
Availability of Big Biomedical Data
The variety, richness, and quantity of biomedical data have been increasing very rapidly. The appropriate analysis of these data has great potential to advance biomedical science. NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
6
Primary Goals of the CCD
Goal 1. Develop and implement state-of-the-art methods for discovering causal knowledge from biomedical big data using causal graphical models Make some of the best existing causal discovery methods available as free, open source software Develop new methods and make them available Goal 2. Investigate three biomedical projects (cancer, lung disease, brain functional connectivity) to evaluate methods and drive their further development Goal 3. Disseminate causal discovery software and knowledge widely to biomedical researchers and data scientists A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.
7
Typical Causal Analysis Workflow
Prior Knowledge Causal Analysis Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network
8
Typical Causal Analysis Workflow
Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network
9
Typical Causal Analysis Workflow
Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network
10
Typical Causal Analysis Workflow
Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Data Causal Network
11
Typical Causal Analysis Workflow
Prior Knowledge Causal Analysis Causal Hypothesis Generation by Biomedical Scientists Experiments Data Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. Causal Network
12
Basic Components Needed to Learn Causal Networks from Data
Model representation Model evaluation Model search Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.
13
Model Represenation Causal Bayesian network (CBN)
Directed acyclic graph Nodes represent variables Arcs represent causal influence Specify P(X | parents(X)) for each X Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example. This figure is adapted from: Sachs K, et al. Protein-signaling networks learned from multi-parameter single-cell data of human T cells Science 308 (2005)
14
Model Representation with CBNs
Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.
15
Model Representation Issues
Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.
16
Model Evaluation Constraint based (e.g., tests of conditional independence) Score based (e.g., Bayesian scores) Causation is important because it estimates the effects of possible actions, which can guide which actions we choose to take. The concept is a general one and includes predicting the response of a cell to a drug, as well as predicting how a given patient is likely to respond to alternative surgical procedures, for example.
17
What is the Big Data Problem on which the CCD is Primarily Focused?
18
Number of variables (nodes)
The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 * Assumes there are no latent variables and no directed cycles.
19
Number of variables (nodes)
The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 25 4 543 * Assumes there are no latent variables and no directed cycles.
20
Number of variables (nodes)
The Number of Causal Model Structures as a Function of the Number of Measured Variables* Number of variables (nodes) Number of Causal Model Structures 1 2 3 25 4 543 5 29,281 6 3,781,503 7 1.1 x 109 8 7.8 x 1011 9 1.2 x 1015 10 4.2 x 1018 * Assumes there are no latent variables and no directed cycles.
21
Our Main Big Data Problem
Analyze biomedical datasets containing a large number of variables in order to generate plausible hypotheses of the causal relationships that hold among those variables NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
22
An Example Algorithm for Causal Discovery with Many Variables: FGES
GES: A popular CBN learning algorithm that uses greedy search and Bayesian scoring* We developed a fast version of GES, called FGES Optimized the single processor version of GES Parallelized GES NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. * Chickering DM. Optimal structure identification with greedy search. Journal of Machine Learning Research 3 (2002)
23
Evaluation of FGES Generated 10 random CBNs
30,000 nodes and 60,000 edges Continuous-variables with linear relationships and Gaussian noise Sampled each CBN to generate 1,000 cases Provided those cases to FGES and measured its ability to learn the data-generating CBN Average Directed Arc Precision Average Directed Arc Recall # Processors Average Learning Time 99% 84% 128 2.3 minutes NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. For more information: Ramsey J, Glymour C. A Million Variables and More: The Fast Greedy Search (FGS) Algorithm for Learning High Dimensional Graphical Causal Models (to appear).
24
Another Example of an Algorithm for Causal Discovery with Many Variables: GFCI
FGES assumes there are no latent confounders, that is, there are no latent variables that cause two or more measured variables Biomedical data often contain latent confounders GFCI* allows for the possibility of latent confounders NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52,
25
Evaluation of GFCI Generated more than 100 random CBNs
1,000 nodes and 2,000 edges Continuous variables with linear Gaussian relationships Sampled each CBN to generate 2,000 cases Provided cases to GFCI and measured its performance % Latent Nodes Average Directed Arc Precision Recall # Processors Average Learning Time 5% 92% 93% 1 15 seconds NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year. For more information: Ogarrio JM, Spirtes P, Ramsey J (2016). A hybrid causal search algorithm for latent variable models. JMLR Workshop and Conference Proceedings, 52,
26
Ongoing Algorithm Work Includes …
Modeling non-linear relationships Modeling causal feedback Handling a mixture of continuous and discrete variables Outputting uncertainty in edge relationships Learning the causal relationships among latent variables A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.
27
Summary Causal discovery is central to biomedical science
The variety, richness, and quantity of biomedical data are increasing rapidly The CCD is providing software now for analyzing big biomedical data to discover causal relationships Causal discovery algorithms with additional capabilities will soon be available as well A group of investigators in the School of Medicine at the University of Pittsburgh will be submitting an application on the topic of modeling and discovery of causal networks from big biomedical data. The primary aims will be to advance the representation, discovery, and uses of causal network models when applied to very large biomedical datasets.
28
Acknowledgements Thanks to the members of the Algorithms Group of the Center for Causal Discovery for their contributions to the activities described in this talk. The Center for Causal Discovery is supported by grant U54HG008540 awarded by the National Human Genome Research Institute through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative ( content of this presentation is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
29
CCD software is available at:
Thank you CCD software is available at: NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
30
NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
31
Extra Slides NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
32
Association Versus Causation
Represents statistical relationships Predicts outcomes from passive observations Example uses: classification and regression Causation: Represents mechanisms Predicts outcomes of active intervention Example uses: decision making and planning NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
33
Example Association Causation Smoking – lung cancer – coughing
Both smoking and coughing predict lung cancer Causation Smoking lung cancer coughing Smoking influences lung cancer Coughing does not influence lung cancer NIH recently announced a funding opportunity for developing methods that help derive biomedical knowledge from big data. Six to eight Centers of Excellence will be supported for up to 4 years starting next year.
34
Recent Examples of the Use of Graphical Causal Discovery Methods
Anticipation-related brain connectivity in bipolar and unipolar depression: A graph theory approach Anna Manelis, Jorge R. C. Almeida, Richelle Stiffler,1 Jeanette C. Lockovich, Haris A. Aslam, Mary L. Phillips. Brain 139 (2016) Dobryakova, E., Costa, S. L., Wylie, G. R., DeLuca, J., & Genova, H. M. (2016). Altered effective connectivity during a processing speed task in individuals with multiple sclerosis. Journal of the International Neuropsychological Society: JINS, 22(2), Otsuka, J. (2016). Discovering phenotypic causal structure from nonexperimental data. Journal of evolutionary biology, 29(6), Attur, M., Statnikov, A., Samuels, J., Li, Z., Alekseyenko, A. V., Greenberg, J. D., et al. (2015). Plasma levels of interleukin-1 receptor antagonist (IL1Ra) predict radiographic progression of symptomatic knee osteoarthritis. Osteoarthritis and Cartilage, 23(11),
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.