Mechanistic models and machine learning methods for TIMET

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

DREAM4 Puzzle – inferring network structure from microarray data Qiong Cheng.

Inferring Quantitative Models of Regulatory Networks From Expression Data Iftach Nachman Hebrew University Aviv Regev Harvard Nir Friedman Hebrew University.

Le Song Joint work with Mladen Kolar and Eric Xing KELLER: Estimating Time Evolving Interactions Between Genes.

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Mechanistic models and machine learning methods for TIMET Dirk Husmeier.

Separation of Scales. Interpretation of Networks Most publications do not consider this.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.

Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.

Basics of Statistical Estimation. Learning Probabilities: Classical Approach Simplest case: Flipping a thumbtack tails heads True probability  is unknown.

J. Daunizeau Wellcome Trust Centre for Neuroimaging, London, UK Institute of Empirical Research in Economics, Zurich, Switzerland Bayesian inference.

Reverse engineering gene and protein regulatory networks using Graphical Models. A comparative evaluation study. Marco Grzegorczyk Dirk Husmeier Adriano.

Goal: Reconstruct Cellular Networks Biocarta. Conditions Genes.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

6. Gene Regulatory Networks

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

Learning Bayesian Networks (From David Heckerman’s tutorial)

Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.

Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.

Statistical Bioinformatics QTL mapping Analysis of DNA sequence alignments Postgenomic data integration Systems biology.

Cis-regulation Trans-regulation 5 Objective: pathway reconstruction.

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Reverse Engineering of Genetic Networks (Final presentation)

Probabilistic Models that uncover the hidden Information Flow in Signalling Networks.

Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.

Learning regulatory networks from postgenomic data and prior knowledge Dirk Husmeier 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

Finding Scientific topics August , Topic Modeling 1.A document as a probabilistic mixture of topics. 2.A topic as a probability distribution.

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

A ROBUST B AYESIAN TWO - SAMPLE TEST FOR DETECTING INTERVALS OF DIFFERENTIAL GENE EXPRESSION IN MICROARRAY TIME SERIES Oliver Stegle, Katherine Denby,

Maximum Likelihood - "Frequentist" inference x 1,x 2,....,x n ~ iid N( ,  2 ) Joint pdf for the whole random sample Maximum likelihood estimates.

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.

Metabolic Network Inference from Multiple Types of Genomic Data Yoshihiro Yamanishi Centre de Bio-informatique, Ecole des Mines de Paris.

Inferring gene regulatory networks with non-stationary dynamic Bayesian networks Dirk Husmeier Frank Dondelinger Sophie Lebre Biomathematics & Statistics.

Reconstructing gene regulatory networks with probabilistic models Marco Grzegorczyk Dirk Husmeier.

Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics.

By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.

Network Evolution Statistics of Networks Comparing Networks Networks in Cellular Biology A. Metabolic Pathways B. Regulatory Networks C. Signaling Pathways.

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

MCMC in structure space MCMC in order space.

Introduction to biological molecular networks

DNAmRNAProtein Small molecules Environment Regulatory RNA How a cell is wired The dynamics of such interactions emerge as cellular processes and functions.

Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

Bayesian inference Lee Harrison York Neuroimaging Centre 23 / 10 / 2009.

Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.

Oliver Schulte Machine Learning 726

Journal club Jun , Zhen.

Incorporating graph priors in Bayesian networks

Learning gene regulatory networks in Arabidopsis thaliana

Bud Mishra Professor of Computer Science and Mathematics 12 ¦ 3 ¦ 2001

Recovering Temporally Rewiring Networks: A Model-based Approach

CSCI 5822 Probabilistic Models of Human and Machine Learning

CSCI 5822 Probabilistic Models of Human and Machine Learning

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Estimating Networks With Jumps

Filtering and State Estimation: Basic Concepts

1 Department of Engineering, 2 Department of Mathematics,

Volume 6, Issue 5, Pages e13 (May 2018)

CSCI 5822 Probabilistic Models of Human and Machine Learning

Network Inference Chris Holmes Oxford Centre for Gene Function, &,

Presentation transcript:

Mechanistic models and machine learning methods for TIMET Dirk Husmeier

Protein signalling pathway Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

Can we learn the signalling pathway from data? Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

High-throughput experiments Network unknown High-throughput experiments Postgenomic data Machine learning Statistics

Methodology Workpackages Mechanistic models Machine learning methods WP1.7: Re-calibrate the circadian clock model for mature plants growing without exogeneous sugars. WP 2.4: Bi-directional regulation: Mechanistic modelling of each metabolic pathway, with connections to the clock. WP 2.5: Bi-directional regulation: Testing predictions of bidirectional models.

Methodology Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

Regulatory network

Elementary molecular biological processes

Description with differential equations

Description with differential equations

Concentrations Rates Kinetic parameters q

Description with differential equations Concentrations Kinetic parameters q Rates

Parameters q known: Numerically integrate the differential equations for different hypothetical networks

Experiment: Gene expression time series Can we infer the correct gene regulatory network?

Model selection for known parameters q Gene expression time series predicted with different models Measured gene expression time series Compare Highest likelihood: best model

Model selection for unknown parameters q Gene expression time series predicted with different models Measured gene expression time series Joint maximum likelihood:

q 1) Practical problem: numerical optimization 2) Conceptual problem: overfitting ML estimate increases on increasing the network complexity

Maximum likelihood parameters Regularization E.g.: BIC Regularization term Data misfit term Maximum likelihood parameters Number of parameters Number of data points

Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space:

Model selection: find the best pathway Select the model with the highest posterior probability: This requires an integration over the whole parameter space: This integral is usually analytically intractable

Complexity problem This requires an integration over the whole parameter space: q The numerical approximation is highly non-trivial

Illustration of annealed importance sampling Posterior distribution Taken from the MSc thesis by Ben Calderhead, Prior distribution

Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations

Computational expensive, network reconstruction ab initio unfeasible Marginal likelihoods for the alternative pathways Computational expensive, network reconstruction ab initio unfeasible

Outer loop: Annealing scheme Centre loop: MCMC Inner loop: Numerical solution of differential equations

NIPS 2008

Objective: Reconstruction of regulatory networks ab initio Higher level of abstraction: Bayesian networks

Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian networks for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

Marriage between graph theory and probability theory Friedman et al. (2000), J. Comp. Biol. 7, 601-620

Bayes net ODE model

Bayesian networks Marriage between graph theory and probability theory. Directed acyclic graph (DAG) representing conditional independence relations. It is possible to score a network in light of the data: P(D|M), D:data, M: network structure. We can infer how well a particular network explains the observed data. NODES A B C EDGES D E F

[A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise Linear model [A]= w1[P1] + w2[P2] + w3[P3] + w4[P4] + noise P1 w1 P2 A w2 w3 P3 w4 P4

Nonlinear discretized model P P1 Activator P2 Activation Repressor Allow for noise: probabilities P P1 Activator P2 Inhibition Conditional multinomial distribution Repressor

Integral analytically tractable! Model Parameters q Integral analytically tractable!

Example: 2 genes  16 different network structures Best network: maximum score

Identify the best network structure Ideal scenario: Large data sets, low noise

Uncertainty about the best network structure Limited number of experimental replications, high noise

Sample of high-scoring networks

Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges

Sample of high-scoring networks Feature extraction, e.g. marginal posterior probabilities of the edges Uncertainty about edges High-confident edge High-confident non-edge

Can we generalize this scheme to more than 2 genes? In principle yes. However …

Number of structures Number of nodes

Sampling from the posterior distribution Find the high-scoring structures Configuration space of network structures

MCMC Local change If accept If accept with probability Configuration space of network structures

Madigan & York (1995), Guidici & Castello (2003)

Problem: Local changes  small steps  slow convergence, difficult to cross valleys. Configuration space of network structures

Problem: Global changes  large steps  low acceptance  slow convergence. Configuration space of network structures

Can we make global changes that jump onto other peaks and are likely to be accepted? Configuration space of network structures

Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene network in Arabidopsis thaliana Current work

This requires an integration over the whole parameter space: Bayesian inference Select the model based on the posterior probability: This requires an integration over the whole parameter space:

Uncertainty about the best network structure Limited number of experimental replications, high noise

Reduced uncertainty by using prior knowledge Data Prior knowledge

Bayesian analysis: integration of prior knowledge β Hyperparameter β trades off data versus prior knowledge Microarray data KEGG pathway

Hyperparameter β trades off data versus prior knowledge β small Microarray data KEGG pathway

Hyperparameter β trades off data versus prior knowledge β large Microarray data KEGG pathway

Input: Learn: MCMC

Raf signalling pathway Receptor molecules Cell membrane Activation Interaction in signalling pathway Phosphorylated protein Inhibition From Sachs et al Science 2005

Flow cytometry data Intracellular multicolour flow cytometry experiments: concentrations of 11 proteins 5400 cells have been measured under 9 different cellular conditions (cues) Downsampling to 100 instances (5 separate subsets): indicative of microarray experiments

Prior knowledge from KEGG 0.87 1 0.71 1 0.25 0.5 0.5 0.5 0.5 Data: protein concentrations from flow cytometry experiments

Protein signalling network from the literature

Predicted network 11 nodes, 20 edges, 90 non-edges 20 top-scoring edges: 15/20 correct 5/90 false 75% 94%

Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

Dynamic Bayesian network

Example: 4 genes, 10 time points

Standard dynamic Bayesian network: homogeneous model X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Our new model: heterogeneous dynamic Bayesian network Our new model: heterogeneous dynamic Bayesian network. Here: 2 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Our new model: heterogeneous dynamic Bayesian network Our new model: heterogeneous dynamic Bayesian network. Here: 3 components t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Learning with MCMC q h k Allocation vector Number of components (here: 3)

Morphogenesis in Drosophila melanogaster Gene expression measurements over 66 time steps of 4028 genes (Arbeitman et al., Science, 2002). Selection of 11 genes involved in muscle development. Zhao et al. (2006), Bioinformatics 22

Heterogeneous dynamic Bayesian network: Plausible segmentation? X(1) X1,1 X1,2 X1,3 X1,4 X1,5 X1,6 X1,7 X1,8 X1,9 X1,10 X(2) X2,1 X2,2 X2,3 X2,4 X2,5 X2,6 X2,7 X2,8 X2,9 X2,10 X(3) X3,1 X3,2 X3,3 X3,4 X3,5 X3,6 X3,7 X3,8 X3,9 X3,10 X(4) X4,1 X4,2 X4,3 X4,4 X4,5 X4,6 X4,7 X4,8 X4,9 X4,10

Number of components

Number of components Four stages of the Drosophila life cycle: embryo  larva  pupa  adult

time

time Morphogenetic transitions: Embryo  larva larva pupa pupa  adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.

Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

Circadian rhythms in Arabidopsis thaliana Collaboration with the Institute of Molecular Plant Sciences at Edinburgh University (Andrew Miller’s group) 2 time series T20 and T28 of microarray gene expression data from Arabidopsis thaliana. - Focus on: 9 circadian genes: LHY, CCA1, TOC1, ELF4, ELF3, GI, PRR9, PRR5, and PRR3 - Both time series measured under constant light condition at 13 time points: 0h, 2h,…, 24h, 26h - Plants entrained with different light:dark cycles 10h:10h (T20) and 14h:14h (T28)

Gene expression time series plots (Arabidopsis data T20 and T28)

Predicted network - medium = PP>0.75 - fat = PP>0.9 Blue – activation Red – inhibition Black – mixture Three different line widths: - thin = PP>0.5 - medium = PP>0.75 - fat = PP>0.9

Cogs of the Plant Clockwork Review – Rob McClung, Plant Cell 2006 Two major gene classes… Morning genes e.g. LHY, CCA1 … repress evening genes e.g. TOC1, ELF3, ELF4, GI, LUX … which activate LHY and CCA1

Literature vs. inferred network ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 86

True positives (TP) = 8 False positives (FP) = 13 False negatives (FN) = 5 True negatives (TN) = 9²-8-13-5= 55 Sensitivity = TP/[TP+FN] = 62% Specificity = TN/[TN+FP] = 81%

Overview of the plant clock model Morning Y (GI) Evening PRR9/ PRR7 LHY/ CCA1 TOC1 Locke et al. Mol. Syst. Biol. 2006 X ZTL Unknown component X allows > 8h delay between TOC1 and LHY/CCA1 expression This is the model as published, note some genes are merged for parsimony. There is data for each link except TOC1 inhibition of GI, and X is unknown TOC1-dependent component that activates LHY and CCA1. Given the long delay, it’s unlikely the model would learn the X link BUT there is a BGM link from ELF3 to CCA1 that is exactly as expected for X. 88

Literature vs. inferred network ELF3 CCA1 LHY PRR9 GI We expect direct inhibition of several evening genes by Lhy/CCa1, here moved together for clarity. Interestingly, some of these were learned, but all arose from CCA1 not LHY – instead, the BGM highlights a sequence of positive or mixed links: Lhy-CCA1-PRR9-GI-TOC1/ELF3/PRR5-ELF4 TOC1 PRR5 PRR3 ELF4 False negatives False positives 89

Machine learning methods Bayesian networks (overview) Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes Circadian gene regulatory network in Arabidopsis thaliana Current work

Flexible network structure with regularization Joint work with Sophie Lèbre and Frank Dondelinger

Drosophila melanogaster: Expression of 11 muscle development genes over 66 time points Fixed structure, flexible parameters time Morphogenetic transitions: Embryo  larva larva pupa pupa  adult Gene expression program governing the transition to adult morphology active well before the fly emerges from the pupa.

Transition probabilities: flexible structure with regularization Morphogenetic transitions: Embryo  larva larva pupa pupa  adult

Comparison with: Ahmed & Xing Dondelinger, Lèbre & Husmeier

Summary Mechanistic models Bayesian networks Integration of biological prior knowledge Non-homogeneous Bayesian network for non-stationary processes

Any questions? Thank you!