Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Computational Inference of Regulatory Networks from.

Similar presentations


Presentation on theme: "1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Computational Inference of Regulatory Networks from."— Presentation transcript:

1 1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html tresch@mpipz.mpg.de Computational Inference of Regulatory Networks from Omics Data

2 Biological networks vs. Network Models Learning Networks from observational data: - Gaussian Graphical Models - Bayesian Networks Learning from interventional data: - Pruning - Nested Effects Model Topics

3 Which biological Network?

4 Transcription Networks

5 From KEGG database

6 qualitative semiquantitative quantitative What do the arrows mean? Do they have a biological interpretation? Best suited for high dimensional, noisy data Which Network Model?

7 Qulitative Network Models Possible Models include Correlation Graphs Gaussian Graphical Models Bayesian Networks However: Correct Reconstruction of the complete regulatory network is impossible due to Lack of data Measurement error Oversimple/wrong model assumptions

8 Correlation Graphs

9 An expression profile is a collection of expression vectors { X g = (X g,s ) s є samples, g є Genes } Correlation graph: Depict genes as vertices of a graph and draw an undirected edge (i, j) if some correlation measure (Pearson correlation, Spearman rank correlation, Kendall’s tau) between X i and X j is sufficiently different from zero. Advantage: This representation of the marginal dependence structure is easy to interpret and can be estimated accurately even if “p » N”, i.e., the number of features p (genes) is much larger than the number of observations N (expression measurements). Application: Stuart et al. (Science, 2003) build a graph from coexpression across multiple organisms. Correlation Graphs

10 It is impossible to distinguish direct from indirect dependence Three reasons why X, Y, and Z may be highly correlated: Possible remedies: search for correlations which cannot be explained by other variables. measure effects of gene perturbations Problems of correlation based approaches A strong correlation is not a strong evidence for regulatory dependence (lots of false positives) rather than a low correlation is a strong evidence for no regulatory edge.

11 In other words: Knowing Z, knowing Y is irrelevant for knowing X (and vice versa). Z explains any observed dependence between X and Y. Conditional Independence

12 Gaussian Graphical Models (GGM)

13 If we assume that the common expression distribution of all genes follows a multivariate Gaussian distribution (which of course is never the case), conditional independence can be assessed as follows:

14 Full conditional relationships can only be accurately estimated if the number of samples N is relatively large compared to the number of variables p. Graph from Basso et al., Nat Genetics 2005 Thus, if p » N, you can... use the Moore-Penrose pseudoinverse, bootstrap aggregation and shrinkage estimators to stabilize the result ( e.g. Schäfer and Strimmer, Bioinformatics 05 ) resort to a simpler model that does not rely on full conditional independence What if p » N?

15 Modified Gaussian Graphical Models (GGM) Correlation Graphs GGMs Wille / Bühlmann All methods failed to accurately reconstruct networks, even if they were of very moderate size (~20)

16 The sprinkler network J = Season S = Sprinkler W = Wet Lawn R = Rain pa(J) = ∅ pa(S) = {J} pa(R) = {J} pa(N) = {S,R} Bayesian Networks: Children depend on parents A directed acyclic graph (DAG) defines parent- child relations and conditional probabilities.

17 E.g. The common distribution of {J,S,R,N} can be coded by the graph topology and 3+4+4+4 =15 real numbers: (instead of 4∙2∙2∙2-1 = 31 real numbers for an arbitrary distribution) Bayesian Networks: The Sprinkler Network

18 Bayesian Networks Problems: Given a directed acyclic graph (DAG), learn the local probability distributions and score the DAG according to its likelihood („how good does this graph fit the data“?) → Parameter estimation, Bayesian Dirichlet metric (Cooper, Herskovits 1992) Find the topology(-ies) of the underlying DAG The latter point is the crucial problem, since there may be DAGs that are equally likely, and there are in general billions of DAGs that score comparably well.

19 Network Identification Model Selection: Find a model wíth maximal (or at least exceptionally high) posterior probability P(Γ|Data) and assume that this ist the true network topology Model Averaging: Draw a large number of random samples Γ from the distribution P(Γ |Data) and approximate P(edge present | Data) by the sum

20 Walk randomly through the space of all possible topologies x y z x y z x y z x y z delete an edgereverse an edgeadd an edge „MCMC Moves“ Γ Some Neighbours of Γ MCMC sampling of directed acyclic graphs

21 Examples of equivalent and non-equivalent graphs m0m0 m1m1 m2m2 m3m3 Each common distribution P(a,b,c), that can be modelled with BN m 0 can also be modelled with m 1 and m 2, and vice versa. However there exist distributions P(a,b,c), which can be modeled with BN m 3, but not with m 0. Causality in Bayesian Networks: Likelihood equivalence

22 Learning from interventional data (A. Fire, C. Mello, Nobel prize 2006)

23 Learning from interventional data

24 TP53 chosen as model-gene for siRNA mediated knock-down nuclear protein, DNA-binding, postulated to bind as tetramer, activates expression of downstream genes that inhibit growth and/or invasion, thus function as a tumor suppressor very well known in literature, easy to find interaction partners Transfection of siRNA into HeLa cell line using HiPerFect transfection-reagent (HeLa cell line is known to exhibit good transfection efficiency) Minimal concentration of siRNA used to reduce possible off- target effects Silencing efficiency regarding mRNA of TP53 was verified using qRT-PCR Proof-of-principle Experiment

25

26 increase decrease Proof of principle: The TP53 network found by siRNA experiment Literature network

27 Pruning of Gene interaction Graphs necessarily direct interactions optional, possibly indirect interactions G2G2 G1G1 G3G3 G4G4 G5G5 Interaction graph

28 Given a gene interaction graph, find edges that survive Occam‘s razor (14 th century): Is this edge “dispensible” or not? “non est ponenda pluritas sine necessitate” (pluralities ought not to be proposed without necessity. Or: Among those models which explain the data equally well, the most simple is to be preferred over the others) G2G2 G1G1 G3G3 Pruning of Gene interaction Graphs Need for algorithm to define and find a minimal consistent and biologically meaningful graph

29 “Trivial”. Remove all edges a→b for which there exists a bypass (a longer way from a to b). “Signs“. Let every edge of the observational graph hava a sign +1 or –1 according to the direction of the regulatory effect. Remove a→b if product of all signs along the path a→... →b equals the sign of the edge a→b “Weights“. Let every edge be weighted with a non- negative number. Edges with low weights are meant to represent edges for which there is strong evidence for a direct regulatory interaction. Remove a→b if sum of the weights along the path a→... →b is smaller than the weight of the edge a→b ax1x1 b Finding non-necessary edges

30 Unpruned vs. Pruned Network The methods do not remove enough edges, too many false positives. Too many interventions (and replicate measurements of them) needed for reliable estimation.

31 Motivation Information flow through a signalling cascade Receptor Protein Small signalling Molecule Transcription Factor Target Gene

32 Motivation Receptor Protein Small signalling Molecule Transcription Factor Target Gene Task: Reconstruct the arrows Information flow through a signalling cascade

33 Motivation Receptor Protein Small signalling Molecule Transcription Factor Target Gene (extracellular hormone concentration) (intracellular metabolite concentration) (protein concentration) (mRNA expression) Task: Reconstruct the arrowsTask: Reconstruct the arrows, without measuring all components Information flow through a signalling cascade

34 Motivation Task: Reconstruct the arrows, without measuring all components Task: Reconstruct the arrows, without measuring all components, from noisy observational data too ambitious… Information flow through a signalling cascade

35 Task: Reconstruct the wiring of a small subset of components, perform interventions on these components. Focus your question Motivation Information flow through a signalling cascade

36 Actions graph: Adjacency matrix Γ Effects graph: Adjacency matrix Θ Actions Predicted effects F t Observables 100 110 110 111 Assumption: Each observable is linked to exactly one action Definition of Nested Effects Models (NEMs) Predicted effect of the leftmost action on the bottom observable (0 = no effect, 1 = effect) Definition: A nested effects model (NEM) is a model if F = Γ Θ

37 Actions Observables Definition of Nested Effects Models (NEMs) 100 110 110 111 Why „nested“ ? █  █  █ Predicted effects If the actions graph is transitively closed, then the effects are nested in the sense that a → b implies effects(a)  effects(b) The present formulation of a NEM does not rely on this assumption

38 a s Actions Observables Effect of action a on observation s R s,a 100 110 110 111 Predicted effects = F t Measured effects = R Likelihood of Nested Effects Models

39 a s 100 110 110 111 Predicted effects FMeasured effects R Likelihood of Nested Effects Models 0.8-2.5-0.2 3.10.9-1.3 0.12.40.1 -0.70.41.7 Observables Actions

40 Let N=0 be the null matrix, i.e. the model predicting no effects at all. Then, Likelihood of Nested Effects Models

41 Results on simulated Data True graphs Γ,Θ simulatedm easure- ments (R) ideal measure- ments ( ΓΘ) R/Bioconductor package: nem

42 Results on simulated Data True graphEstimated graph Distribution of the likelihoods 12 edges, 2 12 =4096 action graphs, ~ 4seconds

43 Results on simulated Data Rank of the true model´s likelihood

44 Automatic Feature Selection I Problem: Some (most) of the observables may not be assigned to any action. Solution: Introduce a „null action“, i.e. add a null row to the actions graph. Observations that are assigned to this action do not contribute to the likelihood. Advantages: The variable selection is included into the estimation procedure, additional time cost is approximately zero.

45 Results on simulated Data 30 genes assigned, 60 genes unassigned True graph Estimated graph

46 Automatic Feature Selection II 50 observations assigned, 5000 not assigned Measurements matrix R True graphs Γ,Θ Penalize R: R → R - δ

47 Automatic Feature Selection II 0 2 4 # false Negatives Define the Posterior per Observation Score (PpO) Scoring function

48 Reconstructions for different choices of the regularization parameter δ Model with 38 responders Model with 77 respondersModel with 300 responders True random Nested Effects Model with 50 responders and 1000 non-responding observables Noisy ratio matrix R Automatic Feature Selection II (non-responding observables were removed from the plot)

49 Application to Drosophila data

50 Results on the Boutros Drosophila Dataset LPS stimulation (unperturbed system) Feature Selection with the help of the „Control“ experiment Handcraft Model

51 Results on the Boutros Drosophila Dataset Automatic Feature Selection, without Control experiment: Estimated graph (120 genes selected)

52 graph: basic class definitions and functionality RBGL: interface to graph algorithms (e.g. shortest path, connectivity) Rgraphviz: Different layout algorithms. Node plotting, line type, colour etc. can be controlled by the user. dynamicGraph: visualize interactive Graphs with TclTK. GeneTS: Estimate GGMs from Microarray Data. NEM: “Nested Effects Models”, implements the Signal- Effects Model Networks in R/Bioconductor

53 Some Pathway Databases: – KEGG (http://www.genome.jp/kegg)http://www.genome.jp/kegg – TRANSPATH (http://www.biobase.de)http://www.biobase.de – Biocarta (http://www.biocarta.com)http://www.biocarta.com – Reactome (http://www.reactome.org)http://www.reactome.org – HumanCyc (http://humancyc.org)http://humancyc.org – Signal Transduktion Knowledge Environment (http://stke.sciencemag.org)http://stke.sciencemag.org Visualization tools – GeneMAPP (www.genemapp.org)www.genemapp.org – GoMiner (http://discover.nci.nih.gov/gominer)http://discover.nci.nih.gov/gominer – Bioconductor/Graphviz (http://www.bioconductor.org)http://www.bioconductor.org – Cytoscape (http://www.cytoscape.org)http://www.cytoscape.org Tools for Network analysis

54 Acknowledgements Florian Markowetz (CRUK Cambridge) Tim Beissbarth (Uni Göttingen) Andreas Buness (EMBL) Markus Ruschhaupt (DKFZ) Holger Fröhlich (Uni Bonn)Mark Fellmann (Roche) Holger Sültmann (Uni Heidelberg) Micheal Boutros (DKFZ)


Download ppt "1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html Computational Inference of Regulatory Networks from."

Similar presentations


Ads by Google