Mechanistic models and machine learning methods for TIMET Dirk Husmeier
Protein signalling pathway Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
Can we learn the signalling pathway from data? Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
High-throughput experiments Network unknown High-throughput experiments Postgenomic data Machine learning Statistics
Methodology Workpackages Mechanistic models Machine learning methods WP1.7: Re-calibrate the circadian clock model for mature plants growing without exogeneous sugars. WP 2.4: Bi-directional regulation: Mechanistic modelling of each metabolic pathway, with connections to the clock. WP 2.5: Bi-directional regulation: Testing predictions of bidirectional models.
Methodology Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes
Regulatory network
Elementary molecular biological processes
Description with differential equations
Description with differential equations
Concentrations Rates Kinetic parameters q
Description with differential equations Concentrations Kinetic parameters q Rates
Parameters q known: Numerically integrate the differential equations for different hypothetical networks
Experiment: Gene expression time series Can we infer the correct gene regulatory network?
Model selection for known parameters q Gene expression time series predicted with different models Measured gene expression time series Compare Highest likelihood: best model
Model selection for unknown parameters q Gene expression time series predicted with different models Measured gene expression time series Joint maximum likelihood:
q 1) Practical problem: numerical optimization 2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity
Maximum likelihood parameters Regularization E.g.: BIC Regularization term Data misfit term Maximum likelihood parameters Number of parameters Number of data points
Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space:
Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space: This integral is usually analytically intractable
Complexity problem This requires an integration over the whole parameter space: q The numerical approximation is highly non-trivial
Illustration of annealed importance sampling Posterior distribution Taken from the MSc thesis by Ben Calderhead, Prior distribution
Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations
Computational expensive, network reconstruction ab initio unfeasible Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible
Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations
NIPS 2008
Objective: Reconstruction of regulatory networks ab initio Higher level of abstraction: Bayesian networks
Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian networks for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
Marriage between graph theory and probability theory Friedman et al. (2000), J. Comp. Biol. 7, 601-620
Bayes net ODE model
Bayesian networks Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data. NODES A B C EDGES D E F
[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise P1 w1 P2 A w2 w3 P3 w4 P4
Nonlinear discretized model P P1 Activator P2 Activation Repressor Allow for noise: probabilities P P1 Activator P2 Inhibition Conditional multinomial distribution Repressor
Integral analytically tractable! Model Parameters q Integral analytically tractable!
Example: 2 genes 16 different network structures Best network: maximum score
Identify the best network structure Ideal scenario: Large data sets, low noise
Uncertainty about the best network structure Limited number of experimental replications, high noise
Sample of high-scoring networks
Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges
Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges Uncertainty about edges High-confident edge High-confident non-edge
Can we generalize this scheme to more than 2 genes? In principle yes. However …
Number of structures Number of nodes
Sampling from the posterior distribution Find the high-scoring structures Configuration space of network structures
MCMC Local change If accept If accept with probability Configuration space of network structures
Madigan & York (1995), Guidici & Castello (2003)
Problem: Local changes small steps slow convergence, difficult to cross valleys. Configuration space of network structures
Problem: Global changes large steps low acceptance slow convergence. Configuration space of network structures
Can we make global changes that jump onto other peaks and are likely to be accepted? Configuration space of network structures
Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene network in Arabidopsis thaliana Current work
This requires an integration over the whole parameter space: Bayesian inference Select the model based on the posterior probability: This requires an integration over the whole parameter space:
Uncertainty about the best network structure Limited number of experimental replications, high noise
Reduced uncertainty by using prior knowledge Data Prior knowledge
Bayesian analysis: integration of prior knowledge β Hyperparameter β trades off data versus prior knowledge Microarray data KEGG pathway
Hyperparameter β trades off data versus prior knowledge β small Microarray data KEGG pathway
Hyperparameter β trades off data versus prior knowledge β large Microarray data KEGG pathway
Input: Learn: MCMC
Raf signalling pathway Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments
Prior knowledge from KEGG 0.87 1 0.71 1 0.25 0.5 0.5 0.5 0.5 Data: protein concentrations from flow cytometry experiments
Protein signalling network from the literature
Predicted network 11 nodes, 20 edges, 90 non-edges 20 top-scoring edges: 15/20 correct 5/90 false 75% 94%
Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
Dynamic Bayesian network
Example: 4 genes, 10 time points
Standard dynamic Bayesian network: homogeneous model X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Our new model: heterogeneous dynamic Bayesian network Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Our new model: heterogeneous dynamic Bayesian network Our new model: heterogeneous dynamic Bayesian network. Here: 3 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Learning with MCMC q h k Allocation vector Number of components (here: 3)
Morphogenesis in Drosophila melanogaster Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22
Heterogeneous dynamic Bayesian network: Plausible segmentation? X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
Number of components
Number of components Four stages of the Drosophila life cycle: embryo larva pupa adult
time
time Morphogenetic transitions: Embryo larva larva pupa pupa adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.
Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
Circadian rhythms in Arabidopsis thaliana Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Miller’s group) 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)
Gene expression time series plots (Arabidopsis data T20 and T28)
Predicted network - medium = PP>0.75 - fat = PP>0.9 Blue – activation Red – inhibition Black – mixture Three different line widths: - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9
Cogs of the Plant Clockwork Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1
Literature vs. inferred network ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 86
True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = 9²-8-13-5= 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81%
Overview of the plant clock model Morning Y (GI) Evening PRR9/ PRR7 LHY/ CCA1 TOC1 Locke et al. Mol. Syst. Biol. 2006 X ZTL Unknown component X allows > 8h delay between TOC1 and LHY/CCA1 expression This is the model as published, note some genes are merged for parsimony. There is data for each link except TOC1 inhibition of GI, and X is unknown TOC1-dependent component that activates LHY and CCA1. Given the long delay, it’s unlikely the model would learn the X link BUT there is a BGM link from ELF3 to CCA1 that is exactly as expected for X. 88
Literature vs. inferred network ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 89
Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
Flexible network structure with regularization Joint work with Sophie Lèbre and Frank Dondelinger
Drosophila melanogaster: Expression of 11 muscle development genes over 66 time points Fixed structure, flexible parameters time Morphogenetic transitions: Embryo larva larva pupa pupa adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.
Transition probabilities: flexible structure with regularization Morphogenetic transitions: Embryo larva larva pupa pupa adult
Comparison with: Ahmed & Xing Dondelinger, Lèbre & Husmeier
Summary Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes
Any questions? Thank you!