Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mechanistic models and machine learning methods for TIMET Dirk Husmeier.

Similar presentations


Presentation on theme: "Mechanistic models and machine learning methods for TIMET Dirk Husmeier."— Presentation transcript:

1 Mechanistic models and machine learning methods for TIMET Dirk Husmeier

2 Protein signalling pathway From Sachs et al Science 2005 Cell membrane Receptor molecules Inhibition Activation Interaction in signalling pathway Phosphorylated protein

3 Can we learn the signalling pathway from data? From Sachs et al Science 2005 Cell membrane Receptor molecules Inhibition Activation Interaction in signalling pathway Phosphorylated protein

4 Network unknown High-throughput experiments Postgenomicdata Machine learning Statistics

5 Methodology Mechanistic models Machine learning methods Workpackages WP1.7: Re-calibrate the circadian clock model for mature plants growing without exogeneous sugars. WP 2.4: Bi-directional regulation: Mechanistic modelling of each metabolic pathway, with connections to the clock. WP 2.5: Bi-directional regulation: Testing predictions of bidirectional models.

6 Methodology Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

7 Regulatory network

8 Elementary molecular biological processes

9 Description with differential equations

10

11 Kinetic parameters q Concentrations Rates

12 Description with differential equations Rates Concentrations Kinetic parameters q

13 Parameters q known: Numerically integrate the differential equations for different hypothetical networks

14 Experiment: Gene expression time series Can we infer the correct gene regulatory network?

15 Model selection for known parameters q Gene expression time series predicted with different models Measured gene expression time series Highest likelihood: best model Compare

16 Model selection for unknown parameters q Gene expression time series predicted with different models Measured gene expression time series Joint maximum likelihood:

17 1) Practical problem: numerical optimization q 2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity

18 Regularization E.g.: BIC Maximum likelihood parameters Number of parameters Number of data points Data misfit term Regularization term

19 Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space:

20 Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space: This integral is usually analytically intractable

21 Complexity problem This requires an integration over the whole parameter space: The numerical approximation is highly non-trivial q

22

23 Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible

24 NIPS 2008

25 Objective: Reconstruction of regulatory networks ab initio Higher level of abstraction: Bayesian networks

26 Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian networks for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

27 Friedman et al. (2000), J. Comp. Biol. 7, 601-620 Marriage between graph theory and probability theory

28 Bayes net ODE model

29 [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model A P1 P2 P4 P3 w1 w4 w2 w3

30 Model Parameters q Integral analytically tractable!

31 Example: 2 genes  16 different network structures Best network: maximum score

32 Identify the best network structure Ideal scenario: Large data sets, low noise

33 Uncertainty about the best network structure Limited number of experimental replications, high noise

34 Sample of high-scoring networks

35 Feature extraction, e.g. marginal posterior probabilities of the edges

36 Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges

37 Can we generalize this scheme to more than 2 genes? In principle yes. However …

38 Number of structures Number of nodes

39 Configuration space of network structures Find the high-scoring structures Sampling from the posterior distribution

40 Configuration space of network structures MCMC Local change Ifaccept If accept with probability

41 Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene network in Arabidopsis thaliana Current work

42 Bayesian inference Select the model based on the posterior probability: This requires an integration over the whole parameter space:

43 Uncertainty about the best network structure Limited number of experimental replications, high noise

44 Reduced uncertainty by using prior knowledge DataPrior knowledge

45 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β Bayesian analysis: integration of prior knowledge

46 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β small

47 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β large

48 Input: Learn: MCMC

49

50 Raf signalling pathway From Sachs et al Science 2005 Cell membrane Receptor molecules Inhibition Activation Interaction in signalling pathway Phosphorylated protein

51 Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

52 Prior knowledge from KEGG 0.25 0 0.5 0 0.87 0 1 0.5 0 0 1 0.71 0 0 Data: protein concentrations from flow cytometry experiments

53 Protein signalling network from the literature

54 Predicted network 11 nodes, 20 edges, 90 non-edges 20 top-scoring edges: 15/20 correct 5/90 false 75% 94%

55 Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

56

57 Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

58 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

59 Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

60 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

61 Learning with MCMC q k h Number of components (here: 3) Allocation vector

62 Morphogenesis in Drosophila melanogaster Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22

63 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Heterogeneous dynamic Bayesian network: Plausible segmentation?

64 Number of components

65 Four stages of the Drosophila life cycle: embryo  larva  pupa  adult

66 time

67 Morphogenetic transitions: Embryo  larva larva  pupa pupa  adult time Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.

68 Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

69 Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Miller’s group) 2 time series T 20 and T 28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T 20 ) and 14h:14h (T 28 ) Circadian rhythms in Arabidopsis thaliana

70 Gene expression time series plots (Arabidopsis data T 20 and T 28 ) T 28 T 20

71 Predicted network Blue – activation Red – inhibition Black – mixture Three different line widths: - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9

72 Cogs of the Plant Clockwork Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1

73 Literature vs. inferred network CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 False positivesFalse negatives

74 True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = 9²-8-13-5= 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81%

75 Overview of the plant clock model Unknown component X allows > 8h delay between TOC1 and LHY/CCA1 expression X LHY/ CCA1 TOC1 Y (GI) PRR9/ PRR7 MorningEvening ZTL Locke et al. Mol. Syst. Biol. 2006

76 Literature vs. inferred network CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 False positivesFalse negatives

77 Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

78 Flexible network structure with regularization Joint work with Sophie Lèbre and Frank Dondelinger

79 Morphogenetic transitions: Embryo  larva larva  pupa pupa  adult time Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa. Drosophila melanogaster: Expression of 11 muscle development genes over 66 time points Fixed structure, flexible parameters

80 Transition probabilities: flexible structure with regularization Morphogenetic transitions: Embryo  larva larva  pupa pupa  adult

81 Comparison with: Dondelinger, Lèbre & Husmeier Ahmed & Xing

82 Summary Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

83 Thank you! Any questions?


Download ppt "Mechanistic models and machine learning methods for TIMET Dirk Husmeier."

Similar presentations


Ads by Google