1 Methods for evaluating inference algorithms June, 2005 Omer Berkman Tel Aviv University, Israel.

Slides:

Advertisements

Similar presentations

Introduction to Monte Carlo Markov chain (MCMC) methods

Advertisements

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Bayesian network for gene regulatory network construction

Autonomic Scaling of Cloud Computing Resources

A Tutorial on Learning with Bayesian Networks

1 Some Comments on Sebastiani et al Nature Genetics 37(4)2005.

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Dynamic Bayesian Networks (DBNs)

CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.

D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.

. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.

Introduction of Probabilistic Reasoning and Bayesian Networks

EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.

From: Probabilistic Methods for Bioinformatics - With an Introduction to Bayesian Networks By: Rich Neapolitan.

Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.

Networks are useful for describing systems of interacting objects, where the nodes represent the objects and the edges represent the interactions between.

How many transcripts does it take to reconstruct the splice graph? Introduction Alternative splicing is the process by which a single gene may be used.

Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.

1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.

Supervised classification performance (prediction) assessment Dr. Huiru Zheng Dr. Franscisco Azuaje School of Computing and Mathematics Faculty of Engineering.

MAE 552 – Heuristic Optimization Lecture 6 February 6, 2002.

Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.

UNIVERSITY OF SOUTH CAROLINA Department of Computer Science and Engineering On-line Alert Systems for Production Plants A Conflict Based Approach.

Evaluating Hypotheses

Simulation and Application on learning gene causal relationships Xin Zhang.

6. Gene Regulatory Networks

Causal Models, Learning Algorithms and their Application to Performance Modeling Jan Lemeire Parallel Systems lab November 15 th 2006.

Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.

CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.

Bayes Net Perspectives on Causation and Causal Inference

Bayesian integration of biological prior knowledge into the reconstruction of gene regulatory networks Dirk Husmeier Adriano V. Werhli.

Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.

Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.

Using Bayesian Networks to Analyze Expression Data By Friedman Nir, Linial Michal, Nachman Iftach, Pe'er Dana (2000) Presented by Nikolaos Aravanis Lysimachos.

Bayesian Learning By Porchelvi Vijayakumar. Cognitive Science Current Problem: How do children learn and how do they get it right?

Stochastic Linear Programming by Series of Monte-Carlo Estimators Leonidas SAKALAUSKAS Institute of Mathematics&Informatics Vilnius, Lithuania

Reverse engineering gene regulatory networks Dirk Husmeier Adriano Werhli Marco Grzegorczyk.

Learning regulatory networks from postgenomic data and prior knowledge Dirk Husmeier 1) Biomathematics & Statistics Scotland 2) Centre for Systems Biology.

Inferring gene regulatory networks from transcriptomic profiles Dirk Husmeier Biomathematics & Statistics Scotland.

Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.

Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.

Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.

Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.

Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.

Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.

Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.

Inferring gene regulatory networks with non-stationary dynamic Bayesian networks Dirk Husmeier Frank Dondelinger Sophie Lebre Biomathematics & Statistics.

The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)

Learning Bayesian networks from postgenomic data with an improved structure MCMC sampling scheme Dirk Husmeier Marco Grzegorczyk 1) Biomathematics & Statistics.

Introduction to biological molecular networks

Reverse engineering of regulatory networks Dirk Husmeier & Adriano Werhli.

1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.

6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

04/21/2005 CS673 1 Being Bayesian About Network Structure A Bayesian Approach to Structure Discovery in Bayesian Networks Nir Friedman and Daphne Koller.

Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.

PRINC E TON School of Engineering and Applied Science Characterizing Mathematical Models for Polymerase Chain Reaction Kinetics Ifunanya Nwogbaga, Henry.

Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.

Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.

Incorporating graph priors in Bayesian networks

Bayesian inference Presented by Amir Hadadi

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Instructors: Fei Fang (This Lecture) and Dave Touretzky

Regulation Analysis using Restricted Boltzmann Machines

Network Inference Chris Holmes Oxford Centre for Gene Function, &,

Presentation transcript:

1 Methods for evaluating inference algorithms June, 2005 Omer Berkman Tel Aviv University, Israel.

2 Introduction A central goal of molecular biology is to understand the regulatory mechanisms of gene transcription and protein synthesis. One major goal of functional genomics research is to take large sets of biological data and elucidate functional interactions between elements in a causal pathway or network. The invention of DNA microarrays, which measure the abundance of thousands of mRNA targets simultaneously, has been an important milestone. Many approaches to the reverse engineering of genetic regulatory networks from gene expression data have been explored.

3 Reminder of the problem When can genetic networks be identified from microarray data? According to Zak et al. (2002) – not too soon: Identifying the network architecture may only be possible when a rich microarray time course is coupled with a priori biological information. Interactions between thousands of genes have to be learned from short time series of typically only about a dozen measurements.

4 It is not clear whether an elicitation of biological structures from expression data is at all possible (whether the posterior probabilities over structures can be expected to be sufficiently informative). Most studies assess the inference results by comparing predicted regulatory interactions with those known from the biological literature. This approach is controversial due to the absence of known gold standards, which renders the estimation unreliable and difficult. Is the inference possible?

5 Having detected an interaction, authors tend to delve into the literature to confirm their findings. They back up their results with interactions between proteins which are similar to the ones encoded by genes studied in the performed experiment. First, sequence similarity does not necessarily imply similar functions. Second, there is an inherent arbitrariness in deciding what is similar. One may therefore suspect that some of the reported true interactions are spurious and do not really support the interactions detected in the reverse engineering procedure. Biological literature – problem one

6 The second and more serious drawback is the difficulty in estimating the false detection rate. This is because on predicting a gene interaction that is not supported by the literature, it is impossible to decide, without further expensive interventions in the form of multiple gene knock-out experiments, whether the algorithm has discovered a new, previously unknown interaction, or whether it has flagged a spurious edge. Biological literature – problem two

7 Simulation motivation Some of the predicted interactions are biologically reasonable, some appear to be unreasonable, but the validity of the vast majority is difficult to assess. Testing all of the predicted interactions experimentally would take decades. No suitable approach has been formulated for evaluating the effectiveness at recovering models of complex biological systems from limited data. To overcome this limitation, recent works used biologically reasonable simulated data.

8 Simulation approach benefits Enables evaluation of different algorithms. Enables better understanding of the inference algorithms. Enables parameters setting such as noise level, network topology and sampling interval. Enables learning with unlimited data. Gives indications about yet-unrealistic environments. Enables a priori to design experiments and data-collection protocols that are amenable to functional network inference.

9 Simulation approach - illustration Build a known network. Create a simulator in which we know all the rules. Sample data from it. Present the sampled data to inference algorithm. Evaluate the algorithm output. Smith et al. 2002

10 The first approach for evaluating the effectiveness of inference algorithms at recovering models from limited data. They created data simulator, sampled the simulated data, and used the sampled data to evaluate the effectiveness of the algorithm they developed. The first application to modeling complex systems at multiple biological levels of organization beyond genetic regulation. Smith et al. (2002) - introduction

11 B RAIN S IM – Smith et al. (2002) Design A simulator which models an electrophysiological activity (0-400 Hz) and expression level of 100 genes (0-50 arbitrary units). 90% of the simulated variables are unregulated by other variables in the system and are included simply as distracters. Smith et al. 2002

12 B RAIN S IM – Initialization Activity begins at a random value. The initial level of a gene is called its ‘target’ value, intends to correspond to its constitutive expression level. This value is selected as a random value in the range 0-10 for up-regulated genes, for down-regulated genes and 0-50 for the unregulated genes.

13 1. Degradation - from all genes a constant amount, chosen to be 4, is added or subtracted to move closer to their target levels. 2. Regulation - from all regulated genes a proportion, chosen to be 0.2, of their regulator’s level is added or subtracted. The unregulated genes added or subtracted a random number, chosen to be 0-5, to simulate regulation by other unmeasured processes. 3. Noise – A random amount, chosen to be 0-6, was added to or subtracted from each gene to simulate stochasticity in gene expression. B RAIN S IM – Dynamics rules

14 B RAIN S IM – Simulated Data Generated data for Activity, Gene1, Gene10 and Gene12. Note the regulation effects and the time lags, as well as the random walk of the unregulated gene. Data sampled every 5 time steps, time 45 – 150, causing loss of information. Quartile discretized values, causing another information loss but making the data more robust.

15 N ETWORK I NFERENCE – Smith et al. (2002) Design Takes a collection of observed data as input. Searches for Bayesian networks that are good at explaining the observed data without unnecessary complexity. Designed in a manner that is capable of searching for Dynamic Bayesian Networks from temporal data.

16 Bayesian networks definitions Representation of a joint probability distribution which consists of two components: G is a directed acyclic graph (DAG) whose vertices correspond to the random variables. θ describes a conditional distribution for each variable, given its parents in G.

17 Bayesian networks illustration The network structure implies that the joint distribution has the product form P(A,B,C,D,E) = P(A)P(B|A,E)P(C|B)P(D|A)P(E) A graph-based model that captures properties of conditional independence between variables. Encodes the Markov assumption: Each variable is independent of its non-descendants, given its parents in the graph.

18 Find the parameters: From Bayes rule we have: And: Learning network from data

19 Bayesian networks properties Can capture many types of relationships between variables. Owing to their probabilistic nature, BN algorithms are capable of handling noisy data. Effective handling of hundreds of variables.

20 BN model genetic networks BN were first applied to this problem by Friedman (2000), Pe’er and Hartemink (2001). At a qualitative level, the structure of a BN describes the relationships between variables in the form of conditional independence relations. At a quantitative level, relationships between the interacting agents are described by conditional probability distributions. The probabilistic nature of this approach is capable of handling noise inherent in both the biological processes and the microarray experiments.

21 Bayesian networks drawbacks Networks with the same skeleton but different edge directions can have the same marginal likelihood P(D|M). Thus it’s impossible to distinguish between them on the basis of the data. The acyclicity constraint rules out recurrent structures. However, feedback is an essential feature of biological systems.

22 Dynamic Bayesian networks illustration Recurrent network comprising two genes with feedback that interact with each other An interaction between genes is not instantaneous. Its effect happens with a time delay after its cause. Equivalent DBN obtained by unfolding the recurrent network in time Husmeier 2003

23 N ETWORK I NFERENCE – Search strategy Identifying the highest-scoring network has been shown to be hard (Chickering 1996). N ETWORK I NFERENCE uses simulated annealing (with extension for re-annealing to avoid becoming trapped in local maxima). N ETWORK I NFERENCE also allows to modify the prior over network structures by specifying sets of links that are required to be present or required to be absent.

24 Simulated annealing overview The concept is based on the process of annealing - melt is slowly cooled so that the system at any time is approximately in thermodynamic equilibrium. If the change in energy is negative the new configuration is accepted. If the change in energy is positive it is accepted with a certain probability.This processes is then repeated sufficient times to give good sampling statistics for the current temperature, and then the temperature is decremented and the process repeated until a frozen state is achieved. If the initial temperature of the system is too low or cooling is done insufficiently slowly the system may become trapped in a local minimum energy state.

25 Simulated annealing illustration Computational Science Education Project

26 N ETWORK I NFERENCE – Results Sampling rate of 5 Recovery: 89±0.1%, Correctness: 98±0.1% Sampling rate of 10 Recovery: 27±0.3%, Correctness: 30±0.4%

27 It is possible to sort out noise and distracters from a specific regulatory network of interest. It is possible to evaluate and thus develop inference algorithms by using simulated data. Shows how sampling influences the inference. Simulator and inference are based on distinguished models Used very simple model for simulator. The inference procedure was tested for an unrealistically large training set Conclusions are not convincing enough. Smith et al. (2002) - discussion

28 The objective of this study is to test the viability of the BN paradigm in a realistic simulation study. First, gene expression data are simulated from a realistic biological network. Then, interaction networks are inferred from these data using DBN and Bayesian learning with Markov Chain Monte Carlo. Husmeier (2003) - abstract

29 One has to resort to heuristic optimization methods. Markov Chain Monte Carlo (Chib and Greenberg 1995): Given a network structure M old, a new structure M new is proposed, with proposal probability Q(M new |M old ). Inference - MCMC Husmeier 2003

30 To avoid an explosion of the model complexity, parameters are tied such that the transition probabilities between time slices are the same. Edges within a time slice are not allowed. A limitation was set on the maximum number of edges converging on a single node. DBN settings Nachman 2004

31 Simulation study Data are generated from a known Bayesian network. New networks are learned and compared with the true network.

32 Evaluating inference Sensitivity is the proportion of recovered true edges: TP / (TP + FN) Complementary specificity is the proportion of erroneously recovered spurious edges: FP / (TN + FP) Plotting the ensuing sensitivity against the corresponding complementary specificity gives the Receiver Operator Characteristic.

33 Evaluation on synthetic data - structure Using a structure of sub-network of the yeast cell cycle, taken from Friedman et al. (2000). In a second series of simulations, 38 redundant nodes were added to this network as confounders, giving a total of 50 nodes.

34 Evaluation on synthetic data - distribution First conditional probability distribution: Noisy regulation according to a binomial distribution Excitation: P(on|on) = 0.9, P(on|off) = 0.1; Inhibition: P(on|on) = 0.1, P(on|off) = 0.9; Noisy XOR co-regulation: P(on|on,on) = P(on|off,off) = 0.1, P(on|off,on) = P(on|on,off) = 0.9; Second conditional probability distributions: Stochastic interaction where all parameters were chosen at random (binomial or trinomial).

35 Evaluation on synthetic – results (ROC curves) Structure 1 Structure 2 T=100 T=30 T=7 Solid line shows noisy regulation, dashed line shows stochastic interaction

36 And still, this is an upper bound The model used for inference is identical to the simulator model. The true continuous signals are typically sampled at discrete time points, which loses information, especially if the sampling intervals are not matched to the relaxation times of the true biological processes. Gene expression ratios are typically discretized, which inevitably adds noise and causes further loss of information.

37 Evaluation on realistic simulated data The study applies the model regulatory network proposed by Zak et al. (2001), but in contrast to it, the system was augmented by adding 41 spurious, unconnected genes (giving a total of 50 genes), which were up- and down- regulated at random. Husmeier 2003

38 Realistic simulated data – two experiments  The first experiment (followed closely Zak’s study): Ligand was injected at time 1000 min. Then, 12 data points were collected over 4000 min in equidistant intervals. The second experiment (adopted a sampling strategy different from Zak): focusing on the time immediately after external perturbation, when the system is in disequilibrium. The sampler therefore collected 12 data points over a shorter interval of only 500 min immediately after ligand injection, between times 1100 and 1600 min.

39 Inference results TrueLearned Solid arrows show true edges and dashed arrows represent spurious edges The most restrictive prior (maximum fan-in = 2). The threshold θ was chosen so as to obtain the same number of edges between non-spurious nodes as in the true network.

40 Realistic simulated data – Results Experiment1 Experiment2 Fan-in=2 Fan-in=3 Fan-in=4 Averaged over three MCMC simulations. The solid line shows the ROC curve obtained from gene expression data alone, while the dash-dotted line shows the ROC curve obtained when including sequence information.

41 Realistic simulated data – Discussion The inclusion of available prior knowledge improves the performance of the inference. Sampling over a long time interval, which covers the system in equilibrium, gives areas that are small, and the low slope of the curves at the left-hand side implies that even the dominant true edges are obscured by a large proportion of spurious edges. Sampling to a short time interval, when the system is in a perturbed non-equilibrium state, gives areas that are significantly increased, and the larger slope of the curves implies that the dominant true edges are obscured by far fewer spurious edges.

42 Discussion The search for new genetic interactions is hard, but it is significantly more effective than a search from tabula rasa. Simulations quantify how much can be learned from the data in the described unfavourable situation. It demonstrates how the network inference performance varies with the training set size, the prior assumptions, the experimental sampling strategy and the inclusion of biological information.

43 References Friedman et al., “Learning the structure of dynamic probabilistic networks”, Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, Friedman et al., “Using Bayesian Networks to Analyze Expression Data”, Journal of Computational Biology, Husmeier Dirk, “Sensitivity and specificity of inferring genetic regulatory interactions from microarray experiments with dynamic Bayesian networks”, Bioinformatics Nachman et al., “Inferring quantitative models of regulatory networks from expression data”, Bioinformatics Smith et al. “Evaluating functional network inference using simulations of complex biological systems”, Bioinformatics Yu et al.,“Advances to Bayesian network inference for generating causal networks from observational biological data”, Bioinformatics Zak et al.,“Local Identifiability: when can genetic networks be identified from microarray data?”, Proceedings of the Third International Conference on Systems Biology 2004.