Download presentation
Presentation is loading. Please wait.
Published byMarshall Wiggins Modified over 9 years ago
1
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland
2
James Watson & Francis Crick, 1953
3
Frederick Sanger, 1980
5
Microarrays Next generation sequencing
7
PART 1 Genomics
18
Maximum likelihood: Forward-backward algorithm Expectation maximization algorithm
19
Bayesian inference: Gibbs sampling Stochastic forward-backward algorithm
20
Beta distribution
25
Factorial HMM
29
PART 2 Systems Biology
32
Network reconstruction from postgenomic data
33
Model Parameters q
34
Friedman et al. (2000), J. Comp. Biol. 7, 601-620 Marriage between graph theory and probability theory
35
Bayes net ODE model
36
Model Parameters q Probability theory Likelihood
37
Model Parameters q Bayesian networks: integral analytically tractable!
38
UAI 1994
39
Identify the best network structure Ideal scenario: Large data sets, low noise
40
Uncertainty about the best network structure Limited number of experimental replications, high noise
41
Sample of high-scoring networks
42
Feature extraction, e.g. marginal posterior probabilities of the edges High-confident edge High-confident non-edge Uncertainty about edges
43
Number of structures Number of nodes Sampling with MCMC
44
Madigan & York (1995), Guidici & Castello (2003)
46
Overview Introduction Limitations Methodology Application to morphogenesis Application to synthetic biology
47
Homogeneity assumption Interactions don’t change with time
48
Limitations of the homogeneity assumption
49
Example: 4 genes, 10 time points t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10
50
Supervised learning. Here: 2 components t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10
51
Changepoint model Parameters can change with time
52
Changepoint model Parameters can change with time
53
t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t8t8 t9t9 t 10 X (1) X 1,1 X 1,2 X 1,3 X 1,4 X 1,5 X 1,6 X 1,7 X 1,8 X 1,9 X 1,10 X (2) X 2,1 X 2,2 X 2,3 X 2,4 X 2,5 X 2,6 X 2,7 X 2,8 X 2,9 X 2,10 X (3) X 3,1 X 3,2 X 3,3 X 3,4 X 3,5 X 3,6 X 3,7 X 3,8 X 3,9 X 3,10 X (4) X 4,1 X 4,2 X 4,3 X 4,4 X 4,5 X 4,6 X 4,7 X 4,8 X 4,9 X 4,10 Unsupervised learning. Here: 3 components
54
Extension of the model q
55
q
56
q k h Number of components (here: 3) Allocation vector
57
Analytically integrate out the parameters q k h Number of components (here: 3) Allocation vector
59
P(network structure | changepoints, data) P(changepoints | network structure, data) Birth, death, and relocation moves RJMCMC within Gibbs
60
Dynamic programming, complexity N 2
62
Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Millar’s group) - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Transcriptional profiles at 4*13 time points in 2h intervals under constant light for - 4 experimental conditions Circadian rhythms in Arabidopsis thaliana
63
Comparison with the literature Precision Proportion of identified interactions that are correct Recall = Sensitivity Proportion of true interactions that we successfully recovered Specificity Proportion of non-interactions that are successfully avoided
64
CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 False negative Which interactions from the literature are found? True positive Blue: activations Red: Inhibitions True positives (TP) = 8 False negatives (FN) = 5 Recall= 8/13= 62%
65
Which proportion of predicted interactions are confirmed by the literature? False positives Blue: activations Red: Inhibitions True positive True positives (TP) = 8 False positives (FP) = 13 Precision = 8/21= 38%
66
Precision= 38% CCA1 LHY PRR9 GI ELF3 TOC1 ELF4 PRR5 PRR3 Recall= 62%
67
Literature = gold standard Scores are pessimistic Precision=50% Recall=50% Not random expectation
68
True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = 9²-8-13-5= 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81% Recall Proportion of avoided non-interactions
69
Model extension So far: non-stationarity in the regulatory process
70
Non-stationarity in the network structure
71
Flexible network structure.
72
Model Parameters q
73
Use prior knowledge!
74
Flexible network structure.
75
Flexible network structure with regularization Hyperparameter Normalization factor
76
Flexible network structure with regularization Exponential prior versus Binomial prior with conjugate beta hyperprior
77
NIPS 2010
78
Overview Introduction Limitations Methodology Application to morphogenesis Application to synthetic biology
79
Morphogenesis in Drosophila melanogaster Gene expression measurements at 66 time points during the life cycle of Drosophila (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22
80
Can we learn the morphogenetic transitions: embryo larva larva pupa pupa adult ?
81
Average posterior probabilities of transitions Morphogenetic transitions: Embryo larva larva pupa pupa adult
83
Can we learn changes in the regulatory network structure ?
85
Overview Introduction Limitations Methodology Application to morphogenesis Application to synthetic biology
88
Can we learn the switch Galactose Glucose? Can we learn the network structure?
89
Task 1: Changepoint detection Switch of the carbon source: Galactose Glucose
91
Task 2: Network reconstruction Precision Proportion of identified interactions that are correct Recall Proportion of true interactions that we successfully recovered
92
BANJO: Conventional homogeneous DBN TSNI: Method based on differential equations Inference: optimization, “best” network
94
Sample of high-scoring networks
95
Marginal posterior probabilities of the edges P=1 P=0 P=0.5
97
Part 3 Future work Strategic issues
101
Phylogenetics phylogenomics High performance computing
102
How are we getting from here …
103
… to there ?!
104
Phylogenetics phylogenomics High performance computing Collaboration with computer scientists
105
Input: Learn: MCMC
107
Phylogenetics phylogenomics High performance computing Collaboration with computer scientists Collaboration with biologists
108
Phylogenetics phylogenomics High performance computing Collaboration with computer scientists Collaboration with biologists MRC University of Glasgow Centre of Excellence in Virology ( virus evolution, virus-host interactions)
109
Scottish Government 2011-2016 science strategy: Climate change and biodiversity
111
Spatial autocorrelation and bio-climate variables Spatial autocorrelation: Z= weighted abundance from Markov neighbourhood. Bio-climate variables: Z= temperature, water, …
112
Ecological Informatics 5, 451-464, 2010
113
Collaboration with Andrej Aderhold V Anne Smith School of Biology University of St Andrews
114
Collaboration with Andrej Aderhold (Computer Scientist) V Anne Smith (Biologist) School of Biology University of St Andrews
115
Computer Science Biology Statistics
116
Phylogenetics phylogenomics High performance computing Collaboration with computer scientists Collaboration with biologists MRC University of Glasgow Centre of Excellence in Virology ( virus evolution, virus-host interactions) Ecological networks and biodiversity
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.