1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Computational Inference of Regulatory Networks from.

Slides:



Advertisements
Similar presentations
Bayesian network for gene regulatory network construction
Advertisements

Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
A Factor Graph Model for Minimal Gene Set Enrichment Analysis Diana Uskat Computational Biology - Gene Center Munich.
Inferring Quantitative Models of Regulatory Networks From Expression Data Iftach Nachman Hebrew University Aviv Regev Harvard Nir Friedman Hebrew University.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Correlation and regression Dr. Ghada Abo-Zaid
Dynamic Bayesian Networks (DBNs)
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
A Probabilistic Dynamical Model for Quantitative Inference of the Regulatory Mechanism of Transcription Guido Sanguinetti, Magnus Rattray and Neil D. Lawrence.
Introduction of Probabilistic Reasoning and Bayesian Networks
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Learning Causality Some slides are from Judea Pearl’s class lecture
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological networks Construction and Analysis. Recap Gene regulatory networks –Transcription Factors: special proteins that function as “keys” to the.
6. Gene Regulatory Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Bayesian Networks Alan Ritter.
Computer vision: models, learning and inference Chapter 10 Graphical Models.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Bayes Net Perspectives on Causation and Causal Inference
Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Probabilistic Models that uncover the hidden Information Flow in Signalling Networks.
Probabilistic Models that uncover the hidden Information Flow in Signalling Networks Achim Tresch.
Statistical analysis Prepared and gathered by Alireza Yousefy(Ph.D)
Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.
Learning regulatory networks from postgenomic data and prior knowledge Dirk Husmeier 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Analysis of the yeast transcriptional regulatory network.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Course files
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
INTERVENTIONS AND INFERENCE / REASONING. Causal models  Recall from yesterday:  Represent relevance using graphs  Causal relevance ⇒ DAGs  Quantitative.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Lecture 2: Statistical learning primer for biologists
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.
04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Institute of Statistics and Decision Sciences In Defense of a Dissertation Submitted for the Degree of Doctor of Philosophy 26 July 2005 Regression Model.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Qian Liu CSE spring University of Pennsylvania
How to understand the cell by breaking it
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Estimating Networks With Jumps
1 Department of Engineering, 2 Department of Mathematics,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Regulation Analysis using Restricted Boltzmann Machines
CORRELATION & REGRESSION compiled by Dr Kunal Pathak
Presentation transcript:

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Computational Inference of Regulatory Networks from Omics Data

Biological networks vs. Network Models Learning Networks from observational data: - Gaussian Graphical Models - Bayesian Networks Learning from interventional data: - Pruning - Nested Effects Model Topics

Which biological Network?

Transcription Networks

From KEGG database

qualitative semiquantitative quantitative What do the arrows mean? Do they have a biological interpretation? Best suited for high dimensional, noisy data Which Network Model?

Qulitative Network Models Possible Models include Correlation Graphs Gaussian Graphical Models Bayesian Networks However: Correct Reconstruction of the complete regulatory network is impossible due to Lack of data Measurement error Oversimple/wrong model assumptions

Correlation Graphs

An expression profile is a collection of expression vectors { X g = (X g,s ) s є samples, g є Genes } Correlation graph: Depict genes as vertices of a graph and draw an undirected edge (i, j) if some correlation measure (Pearson correlation, Spearman rank correlation, Kendall’s tau) between X i and X j is sufficiently different from zero. Advantage: This representation of the marginal dependence structure is easy to interpret and can be estimated accurately even if “p » N”, i.e., the number of features p (genes) is much larger than the number of observations N (expression measurements). Application: Stuart et al. (Science, 2003) build a graph from coexpression across multiple organisms. Correlation Graphs

It is impossible to distinguish direct from indirect dependence Three reasons why X, Y, and Z may be highly correlated: Possible remedies: search for correlations which cannot be explained by other variables. measure effects of gene perturbations Problems of correlation based approaches A strong correlation is not a strong evidence for regulatory dependence (lots of false positives) rather than a low correlation is a strong evidence for no regulatory edge.

In other words: Knowing Z, knowing Y is irrelevant for knowing X (and vice versa). Z explains any observed dependence between X and Y. Conditional Independence

Gaussian Graphical Models (GGM)

If we assume that the common expression distribution of all genes follows a multivariate Gaussian distribution (which of course is never the case), conditional independence can be assessed as follows:

Full conditional relationships can only be accurately estimated if the number of samples N is relatively large compared to the number of variables p. Graph from Basso et al., Nat Genetics 2005 Thus, if p » N, you can... use the Moore-Penrose pseudoinverse, bootstrap aggregation and shrinkage estimators to stabilize the result ( e.g. Schäfer and Strimmer, Bioinformatics 05 ) resort to a simpler model that does not rely on full conditional independence What if p » N?

Modified Gaussian Graphical Models (GGM) Correlation Graphs GGMs Wille / Bühlmann All methods failed to accurately reconstruct networks, even if they were of very moderate size (~20)

The sprinkler network J = Season S = Sprinkler W = Wet Lawn R = Rain pa(J) = ∅ pa(S) = {J} pa(R) = {J} pa(N) = {S,R} Bayesian Networks: Children depend on parents A directed acyclic graph (DAG) defines parent- child relations and conditional probabilities.

E.g. The common distribution of {J,S,R,N} can be coded by the graph topology and =15 real numbers: (instead of 4∙2∙2∙2-1 = 31 real numbers for an arbitrary distribution) Bayesian Networks: The Sprinkler Network

Bayesian Networks Problems: Given a directed acyclic graph (DAG), learn the local probability distributions and score the DAG according to its likelihood („how good does this graph fit the data“?) → Parameter estimation, Bayesian Dirichlet metric (Cooper, Herskovits 1992) Find the topology(-ies) of the underlying DAG The latter point is the crucial problem, since there may be DAGs that are equally likely, and there are in general billions of DAGs that score comparably well.

Network Identification Model Selection: Find a model wíth maximal (or at least exceptionally high) posterior probability P(Γ|Data) and assume that this ist the true network topology Model Averaging: Draw a large number of random samples Γ from the distribution P(Γ |Data) and approximate P(edge present | Data) by the sum

Walk randomly through the space of all possible topologies x y z x y z x y z x y z delete an edgereverse an edgeadd an edge „MCMC Moves“ Γ Some Neighbours of Γ MCMC sampling of directed acyclic graphs

Examples of equivalent and non-equivalent graphs m0m0 m1m1 m2m2 m3m3 Each common distribution P(a,b,c), that can be modelled with BN m 0 can also be modelled with m 1 and m 2, and vice versa. However there exist distributions P(a,b,c), which can be modeled with BN m 3, but not with m 0. Causality in Bayesian Networks: Likelihood equivalence

Learning from interventional data (A. Fire, C. Mello, Nobel prize 2006)

Learning from interventional data

TP53 chosen as model-gene for siRNA mediated knock-down nuclear protein, DNA-binding, postulated to bind as tetramer, activates expression of downstream genes that inhibit growth and/or invasion, thus function as a tumor suppressor very well known in literature, easy to find interaction partners Transfection of siRNA into HeLa cell line using HiPerFect transfection-reagent (HeLa cell line is known to exhibit good transfection efficiency) Minimal concentration of siRNA used to reduce possible off- target effects Silencing efficiency regarding mRNA of TP53 was verified using qRT-PCR Proof-of-principle Experiment

increase decrease Proof of principle: The TP53 network found by siRNA experiment Literature network

Pruning of Gene interaction Graphs necessarily direct interactions optional, possibly indirect interactions G2G2 G1G1 G3G3 G4G4 G5G5 Interaction graph

Given a gene interaction graph, find edges that survive Occam‘s razor (14 th century): Is this edge “dispensible” or not? “non est ponenda pluritas sine necessitate” (pluralities ought not to be proposed without necessity. Or: Among those models which explain the data equally well, the most simple is to be preferred over the others) G2G2 G1G1 G3G3 Pruning of Gene interaction Graphs Need for algorithm to define and find a minimal consistent and biologically meaningful graph

“Trivial”. Remove all edges a→b for which there exists a bypass (a longer way from a to b). “Signs“. Let every edge of the observational graph hava a sign +1 or –1 according to the direction of the regulatory effect. Remove a→b if product of all signs along the path a→... →b equals the sign of the edge a→b “Weights“. Let every edge be weighted with a non- negative number. Edges with low weights are meant to represent edges for which there is strong evidence for a direct regulatory interaction. Remove a→b if sum of the weights along the path a→... →b is smaller than the weight of the edge a→b ax1x1 b Finding non-necessary edges

Unpruned vs. Pruned Network The methods do not remove enough edges, too many false positives. Too many interventions (and replicate measurements of them) needed for reliable estimation.

Motivation Information flow through a signalling cascade Receptor Protein Small signalling Molecule Transcription Factor Target Gene

Motivation Receptor Protein Small signalling Molecule Transcription Factor Target Gene Task: Reconstruct the arrows Information flow through a signalling cascade

Motivation Receptor Protein Small signalling Molecule Transcription Factor Target Gene (extracellular hormone concentration) (intracellular metabolite concentration) (protein concentration) (mRNA expression) Task: Reconstruct the arrowsTask: Reconstruct the arrows, without measuring all components Information flow through a signalling cascade

Motivation Task: Reconstruct the arrows, without measuring all components Task: Reconstruct the arrows, without measuring all components, from noisy observational data too ambitious… Information flow through a signalling cascade

Task: Reconstruct the wiring of a small subset of components, perform interventions on these components. Focus your question Motivation Information flow through a signalling cascade

Actions graph: Adjacency matrix Γ Effects graph: Adjacency matrix Θ Actions Predicted effects F t Observables Assumption: Each observable is linked to exactly one action Definition of Nested Effects Models (NEMs) Predicted effect of the leftmost action on the bottom observable (0 = no effect, 1 = effect) Definition: A nested effects model (NEM) is a model if F = Γ Θ

Actions Observables Definition of Nested Effects Models (NEMs) Why „nested“ ? █  █  █ Predicted effects If the actions graph is transitively closed, then the effects are nested in the sense that a → b implies effects(a)  effects(b) The present formulation of a NEM does not rely on this assumption

a s Actions Observables Effect of action a on observation s R s,a Predicted effects = F t Measured effects = R Likelihood of Nested Effects Models

a s Predicted effects FMeasured effects R Likelihood of Nested Effects Models Observables Actions

Let N=0 be the null matrix, i.e. the model predicting no effects at all. Then, Likelihood of Nested Effects Models

Results on simulated Data True graphs Γ,Θ simulatedm easure- ments (R) ideal measure- ments ( ΓΘ) R/Bioconductor package: nem

Results on simulated Data True graphEstimated graph Distribution of the likelihoods 12 edges, 2 12 =4096 action graphs, ~ 4seconds

Results on simulated Data Rank of the true model´s likelihood

Automatic Feature Selection I Problem: Some (most) of the observables may not be assigned to any action. Solution: Introduce a „null action“, i.e. add a null row to the actions graph. Observations that are assigned to this action do not contribute to the likelihood. Advantages: The variable selection is included into the estimation procedure, additional time cost is approximately zero.

Results on simulated Data 30 genes assigned, 60 genes unassigned True graph Estimated graph

Automatic Feature Selection II 50 observations assigned, 5000 not assigned Measurements matrix R True graphs Γ,Θ Penalize R: R → R - δ

Automatic Feature Selection II # false Negatives Define the Posterior per Observation Score (PpO) Scoring function

Reconstructions for different choices of the regularization parameter δ Model with 38 responders Model with 77 respondersModel with 300 responders True random Nested Effects Model with 50 responders and 1000 non-responding observables Noisy ratio matrix R Automatic Feature Selection II (non-responding observables were removed from the plot)

Application to Drosophila data

Results on the Boutros Drosophila Dataset LPS stimulation (unperturbed system) Feature Selection with the help of the „Control“ experiment Handcraft Model

Results on the Boutros Drosophila Dataset Automatic Feature Selection, without Control experiment: Estimated graph (120 genes selected)

graph: basic class definitions and functionality RBGL: interface to graph algorithms (e.g. shortest path, connectivity) Rgraphviz: Different layout algorithms. Node plotting, line type, colour etc. can be controlled by the user. dynamicGraph: visualize interactive Graphs with TclTK. GeneTS: Estimate GGMs from Microarray Data. NEM: “Nested Effects Models”, implements the Signal- Effects Model Networks in R/Bioconductor

Some Pathway Databases: – KEGG ( – TRANSPATH ( – Biocarta ( – Reactome ( – HumanCyc ( – Signal Transduktion Knowledge Environment ( Visualization tools – GeneMAPP ( – GoMiner ( – Bioconductor/Graphviz ( – Cytoscape ( Tools for Network analysis

Acknowledgements Florian Markowetz (CRUK Cambridge) Tim Beissbarth (Uni Göttingen) Andreas Buness (EMBL) Markus Ruschhaupt (DKFZ) Holger Fröhlich (Uni Bonn)Mark Fellmann (Roche) Holger Sültmann (Uni Heidelberg) Micheal Boutros (DKFZ)