Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.

Similar presentations


Presentation on theme: "Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier."— Presentation transcript:

1 Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier

2

3

4 Regulatory network

5 Network unknown High-throughput experiments Postgenomicdata Machine learning Statistics

6 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

7 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

8 Elementary molecular biological processes

9 Description with differential equations Rates Concentrations Kinetic parameters q

10 Given: Gene expression time series Can we infer the correct gene regulatory network?

11 Parameters q known: Numerically integrate the differential equations for different hypothetical networks

12 Model selection for known parameters q Gene expression time series predicted with different models Measured gene expression time series Highest likelihood: best model Compare

13 Model selection for unknown parameters q Gene expression time series predicted with different models Measured gene expression time series Highest likelihood: over-fitting

14 Bayesian model selection Select the model with the highest posterior probability: This requires an integration of the whole parameter space: This integral is usually intractable

15

16 Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible

17 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

18 Objective: Reconstruction of regulatory networks ab initio Higher level of abstraction: Bayesian networks

19 Bayesian networks A CB D EF NODES EDGES Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data.

20 Bayes net ODE model

21 [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model A P1 P2 P4 P3 w1 w4 w2 w3

22 Nonlinear discretized model P1 P2 P1 P2 Activator Repressor Activator Repressor Activation Inhibition Allow for noise: probabilities Conditional multinomial distribution

23 Model Parameters q Integral analytically tractable!

24

25

26 Example: 2 genes  16 different network structures Best network: maximum score

27 Identify the best network structure Ideal scenario: Large data sets, low noise

28 Uncertainty about the best network structure Limted number of experimental replications, high noise

29 Sample of high-scoring networks

30 Feature extraction, e.g. marginal posterior probabilities of the edges

31 Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges

32 Can we generalize this scheme to more than 2 genes? In principle yes. However …

33 Number of structures Number of nodes

34 Complete enumeration unfeasible  Hill climbing increasesAccept move when

35 Configuration space of network structures Local optimum

36 Configuration space of network structures MCMC Local change Ifaccept If accept with probability

37 Algorithm converges to

38 Madigan & York (1995), Guidici & Castello (2003)

39 Configuration space of network structures Problem: Local changes  small steps  slow convergence, difficult to cross valleys.

40 Configuration space of network structures Problem: Global changes  large steps  low acceptance  slow convergence.

41 Configuration space of network structures Can we make global changes that jump onto other peaks and are likely to be accepted?

42

43 Conventional schemeNew scheme MCMC trace plots Plot of against iteration number

44 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

45

46

47 Cell membran nucleus Example: Protein signalling pathway TF phosphorylation -> cell response

48 Evaluation on the Raf signalling pathway From Sachs et al Science 2005 Cell membrane Receptor molecules Inhibition Activation Interaction in signalling pathway Phosphorylated protein

49 Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

50 Simulated data or “gold standard” from the literature

51

52

53 From Perry Sprawls

54 ROC curve 5 FP counts BN GGM RN

55 ROC curve FP TP Four different evaluation criteria DGE UGE TP for fixed FP Area under the curve (AUC)

56 Synthetic data, observations Relevance networks Bayesian networks Graphical Gaussian models

57 Synthetic data, interventions

58 Cytometry data, interventions

59 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

60

61 Can we complement microarray data with prior knowledge from public data bases like KEGG? KEGG pathway Microarray data

62 How do we extract prior knowledge from a collection of KEGG pathways?

63 Total number of times the gene pair [i,j ] is included in the extracted pathways Total number of edges i  j that appear in the extracted pathways = Example: Extract 20 pathways, 10 contain [i,j ], 8 contain i  j B = 8/10 = 0.8 i,j Relative frequency of edge occurrence

64 Prior knowledge from KEGG Raf network 0.25 0 0.5 0 0.87 0 1 0.5 0 0 0 1 0.71 0 0

65 Prior distribution over networks Deviation between the network M and the prior knowledge B: Prior knowledge ε [0,1] Graph ε {0,1} Hyperparameter

66 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β

67 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β small

68 Hyperparameter β trades off data versus prior knowledge KEGG pathway Microarray data β large

69 Sample networks and hyperparameters from the posterior distribution

70 Revision Prior distribution Marginal likelihood Integral analytically tractable for Bayesian networks

71 Application to the Raf pathway: Flow cytometry data and KEGG

72 ROC curve FP TP Four different evaluation criteria DGE UGE TP for fixed FP Area under the curve (AUC)

73 β

74 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

75

76 Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

77 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

78 Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

79 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Our new model: heterogeneous dynamic Bayesian network. Here: 3 components

80 We have to learn from the data: Number of different components Allocation of time points

81 Two MCMC strategies q k h Number of components (here: 3) Allocation vector

82

83 Synthetic study: posterior probability of the number of components

84 Circadian clock in Arabidopsis thaliana Collaboration with the Institute of Molecular Plant Sciences (Andrew Millar) Focus on 9 circadian genes. 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)

85 macrophage cytomegalovirus Interferon gamma Macrophage Cytomegalovirus (CMV) Interferon gamma IFNγ Infection Treatment Collaboration with DPM

86 macrophage IFNγ 12 hour time course measuring total RNA 0123456789101112 72 Agilent Arrays Time series statistical analysis (using EDGE) Clustering Analysis 30 min sampling 24 samples per group: Infection with CMV Pre-treatment with IFNγ IFNγ + CMV CMV

87 Posterior probability of the number of components

88 IRF1 IRF2 IRF3 Literature  “Known” interactions between three cytokines: IRF1, IRF2 and IRF3 Evaluation: Average marginal posterior probabilities of the edges versus non-edges

89 Sample of high-scoring networks

90 IRF1 IRF2 IRF3 Gold standard known  Posterior probabilities of true interactions

91 AUROC scores New model BGe BDe

92 Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University 2 time series T 20 and T 28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T 20 ) and 14h:14h (T 28 ) Circadian rhythms in Arabidopsis thaliana

93 Gene expression time series plots (Arabidopsis data T 20 and T 28 ) T 28 T 20

94 Posterior probability of the number of components

95 Predicted network Blue – activation Red – inhibition Black – mixture three different line widths - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9

96 Overview Introduction Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

97 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Standard dynamic Bayesian network: homogeneous model

98 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Heterogeneous dynamic Bayesian network

99 Heterogenous dynamic Bayesian network with node-specific breakpoints t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10

100 Evaluation on synthetic data X Y (1) Y (2) Y (3) f: three phase-shifted sinusoids BGe Heterogeneous BNet without/with nodespecific breakpoints AUROC

101 Four time series for A. thaliana under different experimental conditions (KAY,KDE,T 20,T 28 ) Blue – activation Red – inhibition Black – mixture three different line widths - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9 Network obtained for merged data

102 KAY_LLKDE_LL T20T28

103 data Monolithic Separate Propose a compromise between the two

104

105

106 M1M1 M2M2 22 11 D1D1 D2D2 M* MIMI II DIDI... Compromise between the two previous ways of combining the data

107 Original work with Adriano: Poor convergence and mixing due too strong coupling effects. Marco’s current work: Improve convergence and mixing by weakening the coupling.

108 Mean absolute deviation of edge posterior probabilities (independent BN inference) KAYKDET 20 T 28 KAY---0.140.150.14 KDE0.14---0.190.15 T 20 0.150.19---0.10 T 28 0.140.150.10---

109 Mean absolute deviation of edge posterior probabilities (coupled BN inference) KAYKDET 20 T 28 KAY---0.110.120.11 KDE0.11---0.130.11 T 20 0.120.13---0.06 T 28 0.11 0.06---

110 Mean absolute deviation of edge posterior (independent BN - coupled BN) KAYKDET 20 T 28 KAY---0.03 KDE0.03---0.050.03 T 20 0.030.05---0.04 T 28 0.03 0.04---

111 Summary Differential equation models Bayesian networks Comparative evaluation Integration of biological prior knowledge A non-homogeneous Bayesian network for non-stationary processes Current work

112 Adriano Werhli Marco Grzegorzcyk

113 Thank you! Any questions?


Download ppt "Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier."

Similar presentations


Ads by Google