Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.

Slides:

Advertisements

Similar presentations

Bayesian mixture models for analysing gene expression data Natalia Bochkina In collaboration with Alex Lewin, Sylvia Richardson, BAIR Consortium Imperial.

Advertisements

Bayesian network for gene regulatory network construction

A Tutorial on Learning with Bayesian Networks

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Mechanistic models and machine learning methods for TIMET Dirk Husmeier.

Model-based clustering of gene expression data Ka Yee Yeung 1,Chris Fraley 2, Alejandro Murua 3, Adrian E. Raftery 2, and Walter L. Ruzzo 1 1 Department.

Relational Learning with Gaussian Processes By Wei Chu, Vikas Sindhwani, Zoubin Ghahramani, S.Sathiya Keerthi (Columbia, Chicago, Cambridge, Yahoo!) Presented.

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

Reverse engineering gene and protein regulatory networks using Graphical Models. A comparative evaluation study. Marco Grzegorczyk Dirk Husmeier Adriano.

6. Gene Regulatory Networks

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

1 gR2002 Peter Spirtes Carnegie Mellon University.

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.

Learning In Bayesian Networks. Learning Problem Set of random variables X = {W, X, Y, Z, …} Training set D = { x 1, x 2, …, x N }  Each observation specifies.

Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.

Bayes Net Perspectives on Causation and Causal Inference

Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.

Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.

Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Reverse Engineering of Genetic Networks (Final presentation)

ArrayCluster: an analytic tool for clustering, data visualization and module ﬁnder on gene expression proﬁles 組員：李祥豪謝紹陽江建霖.

Probabilistic Models that uncover the hidden Information Flow in Signalling Networks.

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by HAO-WEI, YEH.

Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.

Learning regulatory networks from postgenomic data and prior knowledge Dirk Husmeier 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology.

Statistical Bioinformatics Genomics Transcriptomics Proteomics Systems Biology.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

Bayesian Inversion of Stokes Profiles A.Asensio Ramos (IAC) M. J. Martínez González (LERMA) J. A. Rubiño Martín (IAC) Beaulieu Workshop ( Beaulieu sur.

A ROBUST B AYESIAN TWO - SAMPLE TEST FOR DETECTING INTERVALS OF DIFFERENTIAL GENE EXPRESSION IN MICROARRAY TIME SERIES Oliver Stegle, Katherine Denby,

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Randomized Algorithms for Bayesian Hierarchical Clustering

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Inferring gene regulatory networks with non-stationary dynamic Bayesian networks Dirk Husmeier Frank Dondelinger Sophie Lebre Biomathematics & Statistics.

Learning In Bayesian Networks. General Learning Problem Set of random variables X = {X 1, X 2, X 3, X 4, …} Training set D = { X (1), X (2), …, X (N)

Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics.

By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

MCMC in structure space MCMC in order space.

Introduction to biological molecular networks

BAYESIAN INFERENCE OF SIGNALING NETWORK TOPOLOGY IN A CANCER CELL LINE Steven M. Hill, Yiling Lu, Jennifer Molina, Laura M. Heiser, Paul T. Spellman, Terence.

Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

Objective Evaluation of Intelligent Medical Systems using a Bayesian Approach to Analysis of ROC Curves Julian Tilbury Peter Van Eetvelt John Curnow Emmanuel.

Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.

Mechanistic models and machine learning methods for TIMET

Bayesian Hierarchical Clustering Paper by K. Heller and Z. Ghahramani ICML 2005 Presented by David Williams Paper Discussion Group ( )

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Fast Exact Bayes Net Structure Learning Daniel Eaton Tuesday Oct 31, 2006 relatively-speakingly-

Analyzing circadian expression data by harmonic regression based on autoregressive spectral estimation Rendong Yang and Zhen Su Division of Bioinformatics,

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.

1. SELECTION OF THE KEY GENE SET 2. BIOLOGICAL NETWORK SELECTION

Incorporating graph priors in Bayesian networks

Learning gene regulatory networks in Arabidopsis thaliana

CSCI 5822 Probabilistic Models of Human and Machine Learning

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

CSCI 5822 Probabilistic Models of Human and Machine Learning

Graduate School of Information Sciences, Tohoku University

Network Inference Chris Holmes Oxford Centre for Gene Function, &,

Presentation transcript:

Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier

Regulatory network

Network unknown High-throughput experiments Postgenomicdata Machine learning Statistics

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Elementary molecular biological processes

Description with differential equations Rates Concentrations Kinetic parameters q

Given: Gene expression time series Can we infer the correct gene regulatory network?

Parameters q known: Numerically integrate the differential equations for different hypothetical networks

Model selection for known parameters q Gene expression time series predicted with different models Measured gene expression time series Highest likelihood: best model Compare

Model selection for unknown parameters q Gene expression time series predicted with different models Measured gene expression time series Highest likelihood: over-fitting

Bayesian model selection Select the model with the highest posterior probability: This requires an integration of the whole parameter space: This integral is usually intractable

Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Objective: Reconstruction of regulatory networks ab initio Higher level of abstraction: Bayesian networks

Bayesian networks A CB D EF NODES EDGES Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data.

Bayes net ODE model

[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model A P1 P2 P4 P3 w1 w4 w2 w3

Nonlinear discretized model P1 P2 P1 P2 Activator Repressor Activator Repressor Activation Inhibition Allow for noise: probabilities Conditional multinomial distribution

Model Parameters q Integral analytically tractable!

Example: 2 genes  16 different network structures Best network: maximum score

Identify the best network structure Ideal scenario: Large data sets, low noise

Uncertainty about the best network structure Limted number of experimental replications, high noise

Sample of high-scoring networks

Feature extraction, e.g. marginal posterior probabilities of the edges

Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges

Can we generalize this scheme to more than 2 genes? In principle yes. However …

Number of structures Number of nodes

Complete enumeration unfeasible  Hill climbing increasesAccept move when

Configuration space of network structures Local optimum

Configuration space of network structures MCMC Local change Ifaccept If accept with probability

Algorithm converges to

Madigan & York (1995), Guidici & Castello (2003)

Configuration space of network structures Problem: Local changes  small steps  slow convergence, difficult to cross valleys.

Configuration space of network structures Problem: Global changes  large steps  low acceptance  slow convergence.

Configuration space of network structures Can we make global changes that jump onto other peaks and are likely to be accepted?

Conventional schemeNew scheme MCMC trace plots Plot of against iteration number

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Cell membran nucleus Example: Protein signalling pathway TF phosphorylation -> cell response

Evaluation on the Raf signalling pathway From Sachs et al Science 2005 Cell membrane Receptor molecules Inhibition Activation Interaction in signalling pathway Phosphorylated protein

Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

Simulated data or “gold standard” from the literature

From Perry Sprawls

ROC curve 5 FP counts BN GGM RN

ROC curve FP TP Four different evaluation criteria DGE UGE TP for fixed FP Area under the curve (AUC)

Synthetic data, observations Relevance networks Bayesian networks Graphical Gaussian models

Synthetic data, interventions

Cytometry data, interventions

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Can we complement microarray data with prior knowledge from public data bases like KEGG? KEGG pathway Microarray data

How do we extract prior knowledge from a collection of KEGG pathways?

Total number of times the gene pair [i,j ] is included in the extracted pathways Total number of edges i  j that appear in the extracted pathways = Example: Extract 20 pathways, 10 contain [i,j ], 8 contain i  j B = 8/10 = 0.8 i,j Relative frequency of edge occurrence

Prior knowledge from KEGG Raf network

Prior distribution over networks Deviation between the network M and the prior knowledge B: Prior knowledge ε [0,1] Graph ε {0,1} Hyperparameter

Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β

Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β small

Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β large

Sample networks and hyperparameters from the posterior distribution

Revision Prior distribution Marginal likelihood Integral analytically tractable for Bayesian networks

Application to the Raf pathway: Flow cytometry data and KEGG

ROC curve FP TP Four different evaluation criteria DGE UGE TP for fixed FP Area under the curve (AUC)

β

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

We have to learn from the data: Number of different components Allocation of time points

Two MCMC strategies q k h Number of components (here: 3) Allocation vector

Synthetic study: posterior probability of the number of components

Circadian clock in Arabidopsis thaliana Collaboration with the Institute of Molecular Plant Sciences (Andrew Millar) Focus on 9 circadian genes. 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)

macrophage cytomegalovirus Interferon gamma Macrophage Cytomegalovirus (CMV) Interferon gamma IFNγ Infection Treatment Collaboration with DPM

macrophage IFNγ 12 hour time course measuring total RNA Agilent Arrays Time series statistical analysis (using EDGE) Clustering Analysis 30 min sampling 24 samples per group: Infection with CMV Pre-treatment with IFNγ IFNγ + CMV CMV

Posterior probability of the number of components

IRF1 IRF2 IRF3 Literature  “Known” interactions between three cytokines: IRF1, IRF2 and IRF3 Evaluation: Average marginal posterior probabilities of the edges versus non-edges

Sample of high-scoring networks

IRF1 IRF2 IRF3 Gold standard known  Posterior probabilities of true interactions

AUROC scores New model BGe BDe

Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University 2 time series T 20 and T 28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T 20 ) and 14h:14h (T 28 ) Circadian rhythms in Arabidopsis thaliana

Gene expression time series plots (Arabidopsis data T 20 and T 28 ) T 28 T 20

Posterior probability of the number of components

Predicted network Blue – activation Red – inhibition Black – mixture three different line widths - thin = PP>0.5 - medium = PP> fat = PP>0.9

Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Heterogeneous dynamic Bayesian network

Heterogenous dynamic Bayesian network with node-specific breakpoints t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

Evaluation on synthetic data X Y (1) Y (2) Y (3) f: three phase-shifted sinusoids BGe Heterogeneous BNet without/with nodespecific breakpoints AUROC

Four time series for A. thaliana under different experimental conditions (KAY,KDE,T 20,T 28 ) Blue – activation Red – inhibition Black – mixture three different line widths - thin = PP>0.5 - medium = PP> fat = PP>0.9 Network obtained for merged data

KAY_LLKDE_LL T20T28

data Monolithic Separate Propose a compromise between the two

M1M1 M2M2 22 11 D1D1 D2D2 M* MIMI II DIDI... Compromise between the two previous ways of combining the data

Original work with Adriano: Poor convergence and mixing due too strong coupling effects. Marco’s current work: Improve convergence and mixing by weakening the coupling.

Mean absolute deviation of edge posterior probabilities (independent BN inference) KAYKDET 20 T 28 KAY KDE T T

Mean absolute deviation of edge posterior probabilities (coupled BN inference) KAYKDET 20 T 28 KAY KDE T T

Mean absolute deviation of edge posterior (independent BN - coupled BN) KAYKDET 20 T 28 KAY KDE T T

Summary Differential equation models Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

Adriano Werhli Marco Grzegorzcyk

Thank you! Any questions?