Download presentation
Published byPatricia Daniels Modified over 8 years ago
1
Mechanistic models and machine learning methods for TIMET
Dirk Husmeier
2
Protein signalling pathway
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
3
Can we learn the signalling pathway from data?
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
4
High-throughput experiments
Network unknown High-throughput experiments Postgenomic data Machine learning Statistics
5
Methodology Workpackages Mechanistic models Machine learning methods
WP1.7: Re-calibrate the circadian clock model for mature plants growing without exogeneous sugars. WP 2.4: Bi-directional regulation: Mechanistic modelling of each metabolic pathway, with connections to the clock. WP 2.5: Bi-directional regulation: Testing predictions of bidirectional models.
6
Methodology Mechanistic models Bayesian networks
Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes
7
Regulatory network
8
Elementary molecular biological processes
9
Description with differential equations
10
Description with differential equations
11
Concentrations Rates Kinetic parameters q
12
Description with differential equations
Concentrations Kinetic parameters q Rates
13
Parameters q known: Numerically integrate the differential equations for different hypothetical networks
14
Experiment: Gene expression time series
Can we infer the correct gene regulatory network?
15
Model selection for known parameters q
Gene expression time series predicted with different models Measured gene expression time series Compare Highest likelihood: best model
16
Model selection for unknown parameters q
Gene expression time series predicted with different models Measured gene expression time series Joint maximum likelihood:
17
q 1) Practical problem: numerical optimization
2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity
18
Maximum likelihood parameters
Regularization E.g.: BIC Regularization term Data misfit term Maximum likelihood parameters Number of parameters Number of data points
19
Model selection: find the best pathway
Select the model with the highest posterior probability: This requires an integration over the whole parameter space:
20
Model selection: find the best pathway
Select the model with the highest posterior probability: This requires an integration over the whole parameter space: This integral is usually analytically intractable
21
Complexity problem This requires an integration over the whole parameter space: q The numerical approximation is highly non-trivial
23
Illustration of annealed importance sampling
Posterior distribution Taken from the MSc thesis by Ben Calderhead, Prior distribution
24
Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations
25
Computational expensive, network reconstruction ab initio unfeasible
Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible
26
Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations
27
NIPS 2008
28
Objective: Reconstruction of regulatory networks ab initio
Higher level of abstraction: Bayesian networks
29
Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian networks for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
30
Marriage between graph theory and probability theory
Friedman et al. (2000), J. Comp. Biol. 7,
31
Bayes net ODE model
32
Bayesian networks Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data. NODES A B C EDGES D E F
33
[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise
Linear model [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise P1 w1 P2 A w2 w3 P3 w4 P4
34
Nonlinear discretized model
P P1 Activator P2 Activation Repressor Allow for noise: probabilities P P1 Activator P2 Inhibition Conditional multinomial distribution Repressor
35
Integral analytically tractable!
Model Parameters q Integral analytically tractable!
37
Example: 2 genes 16 different network structures
Best network: maximum score
38
Identify the best network structure
Ideal scenario: Large data sets, low noise
39
Uncertainty about the best network structure
Limited number of experimental replications, high noise
40
Sample of high-scoring networks
41
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges
42
Sample of high-scoring networks
Feature extraction, e.g. marginal posterior probabilities of the edges Uncertainty about edges High-confident edge High-confident non-edge
43
Can we generalize this scheme to more than 2 genes?
In principle yes. However …
44
Number of structures Number of nodes
45
Sampling from the posterior distribution
Find the high-scoring structures Configuration space of network structures
46
MCMC Local change If accept If
accept with probability Configuration space of network structures
47
Madigan & York (1995), Guidici & Castello (2003)
48
Problem: Local changes small steps slow convergence, difficult to cross valleys.
Configuration space of network structures
49
Problem: Global changes large steps low acceptance slow convergence.
Configuration space of network structures
50
Can we make global changes that jump onto other peaks and are likely to be accepted?
Configuration space of network structures
52
Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene network in Arabidopsis thaliana Current work
53
This requires an integration over the whole parameter space:
Bayesian inference Select the model based on the posterior probability: This requires an integration over the whole parameter space:
54
Uncertainty about the best network structure
Limited number of experimental replications, high noise
55
Reduced uncertainty by using prior knowledge
Data Prior knowledge
56
Bayesian analysis: integration of prior knowledge β
Hyperparameter β trades off data versus prior knowledge Microarray data KEGG pathway
57
Hyperparameter β trades off data versus prior knowledge
β small Microarray data KEGG pathway
58
Hyperparameter β trades off data versus prior knowledge β large
Microarray data KEGG pathway
59
Input: Learn: MCMC
61
Raf signalling pathway
Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005
62
Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments
63
Prior knowledge from KEGG
0.87 1 0.71 1 0.25 0.5 0.5 0.5 0.5 Data: protein concentrations from flow cytometry experiments
64
Protein signalling network from the literature
65
Predicted network 11 nodes, 20 edges, 90 non-edges
20 top-scoring edges: /20 correct 5/90 false 75% 94%
66
Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
68
Dynamic Bayesian network
69
Example: 4 genes, 10 time points
70
Standard dynamic Bayesian network: homogeneous model
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
71
Our new model: heterogeneous dynamic Bayesian network
Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
72
Our new model: heterogeneous dynamic Bayesian network
Our new model: heterogeneous dynamic Bayesian network. Here: 3 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
73
Learning with MCMC q h k Allocation vector
Number of components (here: 3)
75
Morphogenesis in Drosophila melanogaster
Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22
76
Heterogeneous dynamic Bayesian network: Plausible segmentation?
X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10
77
Number of components
78
Number of components Four stages of the Drosophila life cycle: embryo larva pupa adult
79
time
80
time Morphogenetic transitions: Embryo larva larva pupa pupa adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.
81
Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
82
Circadian rhythms in Arabidopsis thaliana
Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Miller’s group) 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)
83
Gene expression time series plots (Arabidopsis data T20 and T28)
84
Predicted network - medium = PP>0.75 - fat = PP>0.9
Blue – activation Red – inhibition Black – mixture Three different line widths: - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9
85
Cogs of the Plant Clockwork
Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1
86
Literature vs. inferred network
ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 86
87
True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = ² = 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81%
88
Overview of the plant clock model
Morning Y (GI) Evening PRR9/ PRR7 LHY/ CCA1 TOC1 Locke et al. Mol. Syst. Biol. 2006 X ZTL Unknown component X allows > 8h delay between TOC1 and LHY/CCA1 expression This is the model as published, note some genes are merged for parsimony. There is data for each link except TOC1 inhibition of GI, and X is unknown TOC1-dependent component that activates LHY and CCA1. Given the long delay, it’s unlikely the model would learn the X link BUT there is a BGM link from ELF3 to CCA1 that is exactly as expected for X. 88
89
Literature vs. inferred network
ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 89
90
Machine learning methods
Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work
91
Flexible network structure with regularization Joint work with Sophie Lèbre and Frank Dondelinger
92
Drosophila melanogaster: Expression of 11 muscle development genes over 66 time points
Fixed structure, flexible parameters time Morphogenetic transitions: Embryo larva larva pupa pupa adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.
93
Transition probabilities: flexible structure with regularization
Morphogenetic transitions: Embryo larva larva pupa pupa adult
94
Comparison with: Ahmed & Xing Dondelinger, Lèbre & Husmeier
95
Summary Mechanistic models Bayesian networks
Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes
96
Any questions? Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.