Robust inference of biological Bayesian networks

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

DREAM4 Puzzle – inferring network structure from microarray data Qiong Cheng.
Probabilistic modelling in computational biology Dirk Husmeier Biomathematics & Statistics Scotland.
Dynamic Bayesian Networks (DBNs)
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Mutual Information Mathematical Biology Seminar
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
1. Elements of the Genetic Algorithm  Genome: A finite dynamical system model as a set of d polynomials over  2 (finite field of 2 elements)  Fitness.
Novel Self-Configurable Positioning Technique for Multihop Wireless Networks Authors : Hongyi Wu Chong Wang Nian-Feng Tzeng IEEE/ACM TRANSACTIONS ON NETWORKING,
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Gaussian Processes for Transcription Factor Protein Inference Neil D. Lawrence, Guido Sanguinetti and Magnus Rattray.
JCKBSE2010 Kaunas Predicting Combinatorial Protein-Protein Interactions from Protein Expression Data Based on Correlation Coefficient Sho Murakami, Takuya.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Preserving Link Privacy in Social Network Based Systems Prateek Mittal University of California, Berkeley Charalampos Papamanthou.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
A computational study of protein folding pathways Reducing the computational complexity of the folding process using the building block folding model.
Using Bayesian Networks to Analyze Expression Data By Friedman Nir, Linial Michal, Nachman Iftach, Pe'er Dana (2000) Presented by Nikolaos Aravanis Lysimachos.
Conceptual Foundations © 2008 Pearson Education Australia Lecture slides for this course are based on teaching materials provided/referred by: (1) Statistics.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Formal Semantics Chapter Twenty-ThreeModern Programming Languages, 2nd ed.1.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Learning the Structure of Related Tasks Presented by Lihan He Machine Learning Reading Group Duke University 02/03/2006 A. Niculescu-Mizil, R. Caruana.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Mitigation strategies on scale-free networks against cascading failures Jianwei Wang Adviser: Frank,Yeong-Sung Lin Present by Chris Chang.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.
Dependency networks Sushmita Roy BMI/CS 576 Nov 25 th, 2014.
Cluster validation Integration ICES Bioinformatics.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
July 23, BSA, a Fast and Accurate Spike Train Encoding Scheme Benjamin Schrauwen.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Optimization Problems
Hiroshi de Silva, A. Shehan Perera
Presented by Khawar Shakeel
An Artificial Intelligence Approach to Precision Oncology
Biological networks CS 5263 Bioinformatics.
A Simple Approach to Ranking Differentially Expressed Gene Expression Time Courses through Gaussian Process Regression By Alfredo A Kalaitzis and Neil.
Bud Mishra Professor of Computer Science and Mathematics 12 ¦ 3 ¦ 2001
Data Mining (and machine learning)
Luís Filipe Martinsª, Fernando Netoª,b. 
Design and Analysis of Algorithms (07 Credits / 4 hours per week)
Data Mining Practical Machine Learning Tools and Techniques
Building and Analyzing Genome-Wide Gene Disruption Networks
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Optimization Problems
1 Department of Engineering, 2 Department of Mathematics,
Clustering.
Regulation Analysis using Restricted Boltzmann Machines
Analyzing Time Series Gene Expression Data
An Algorithm for Bayesian Network Construction from Data
Class #19 – Tuesday, November 3
Pejman Mohammadi, Niko Beerenwinkel, Yaakov Benenson  Cell Systems 
Ensemble learning.
SEG5010 Presentation Zhou Lanjun.
Quantum-Classical Hybrid Machine Learning for Gene Regulatory Pathways
Scalable light field coding using weighted binary images
Clustering.
Introduction to Artificial Intelligence Lecture 22: Computer Vision II
Presentation transcript:

Robust inference of biological Bayesian networks Masoud Rostami and Kartik Mohanram Department of Electrical and Computer Engineering Rice University, Houston, TX Good morning everyone and thank you for attending my talk today. My name is Masoud Rostami. The tile of my presentation is …

Outline Regulatory networks Inference techniques, Bayesian networks Quantization techniques Improving quantization by bootstrapping Results on SOS network Conclusions Here is a brief outline for the talk. I begin by introducing regulatory networks. Then we discuss techniques that are used to infer Regulatory networks from micro-array data. Among them, Bayesian networks is among the most widely methods to analyze Regulatory networks. It is well-known that qunatization influences the quality. One of the critical steps is quantization and our contribution is improving quantization by bootstrapping Techniques, to enhance the quality of inferred network. We review common quantization techniques and then we’ll introduce a technique based on bootstrapping. We show its efficiency by applying it to SOS network , a network which its true structure is known. Then, I’ll conclude and discuss the directions.

Gene regulatory networks Cells are controlled by gene regulatory networks Microarray shows gene expression Relative expression of genes over period of time Reverse engineering to find the underlying network May be used for drug discovery Pros Large amount of data in public repositories Cons Data-point scarcity High levels of noise Biochemical reactions in cells are controlled by gene regulatory network. This network responds to external perturbation by regulating protein production. Microarray technology is used to study the relative expression of genes over a period of time. Then this information is used to infer The underlying network by reverse engineering techniques. These networks may be used later for drug discovery. Lots of data is now available in public repositories, but still the data point samples of gene expression is scarce. And besides, the data is inherently noisy. So, efficient techniques for network inference is now a matter of the utmost interest.

Network inference Several techniques to infer with different models Bayesian networks Dynamic Bayesian networks Neural networks Clustering Boolean networks Question of accuracy, stability, and overhead No consensus Bayesian networks have solid mathematical foundation Several techniques have been proposed for inference. The network may be modeled by BN, or its kin DBN. One may use NN, clustering, or Boolean networks. All of them make different abstraction about the data, and have different merits of accuracy, stability, and computational overhead. Still there is no silver bullet for inference. In here, we focus on BN, because it has a solid mathematical foundation and lots of tools and algorithms have been developed for it, and is widely studied in this field.

Bayesian networks Directed acyclic graph with annotated edges Structure Parameters Product of conditional probabilities NP-hard A fitness score is assigned to candidates Score: how likely the candidate generated the data BN is a directed acyclic graph with annotated edges. In the context of gene regulatory networks, nodes are genes and … The BN converts the joint statistical distribution of all variables to product of conditional probabilities. Finding the best network is NP-hard. So, a fitness score is assigned to candidate graphs and a search algorithm tries to find a graph with the best score. The probability that the candidate graph has generated the data is usually taken as the score.

Bayesian networks Heuristics to find the best score Simulated annealing Hill-climbing Evolutionary algorithms No notion of time steps It needs discrete data At most ternary Due to scarce data How to quantize data? The search algorithm for highest scoring graphs can be … Over all, BN does not have any notion of time and its inputs should be discrete values. Due to fact that the required time samples have a super exponential dependence on quantization levels. The data is usually binary or at most ternary. Now the question is this, how to quantize the data? Which is the focus of this project.

Quantization Should be smoothed? (remove spikes) Mean? Median? (quantile quantization) More robust to outliers (max+min)/2? (interval quantization) … Can we extract as much as information as possible? Should data be smoothed prior to quantization to remove spikes? Should we use the average for threshold of quantization. Why not median? What about the midpoint between maximum and minimum? As it is taught in statistic courses, the median is the most robust indicator of data against noise and we have found out that it has better performance. But, can we extract as much as information from data by these common quantization techniques?

An example Method of quantization impacts the inferred network In here, you can see an example of a real data extracted from a public repository. You can see that first, the available time-points are just a few. There are some time-points in the figure that you are not sure about how you should quantize it. Should you assign them to ‘0’ or should you assign them ‘1’. Due to scarcity of data, your choice of quantization method has a huge impact on the inferred network. Data is also noisy which may even make the process completely unstable. [1] GDS1303[ACCN], GEO database

Time-series Each sample is dependent on its neighbor Gene expression samples are dependent Data does have some structure (it’s a waveform) Common quantization removes this information The other often missed charactristic of microarray data, is that they are time-series. Something that has been neglected in literature so far. Gene expressions over a course of an hours are statistically correlated as the weather of Anaheim at 3PM and 4PM are correlated. However all of those quantization techniques just miss this information. Gene expression is a waveform and has some implied structure or up and down, and we should use in our quantization. So, How can we preserve this information?

Better inference Artificial ways to increase samples Represent each sample n times Takes ‘0’ and ‘1’ according to the probability 10 times, p(‘1’) = 0.20 2 times ‘1’, 8 times ‘0’ Adds computational overhead How to quantify probability Use correlation information Noise model? So, here comes our contributions. We looked into artificial ways to increase the available samples to us, and at the end we get better accurate network. So, we represent each sample n times in the quantized dataset. These n samples take ‘0’ and ‘1’ according to probability of being ‘0’ and ‘1’. For example, if we repeat it 10 times, and the probability of being ‘1’ is 20% then times from this 10 times we assing ‘1’ and the rest will be ‘0’. Increasing the number of samples increase the computational complexity, but we’ll show later that it is worth it. So, how we should find this probability with using correlation between samples? Do we need noise models?

Time-series Bootstrapping Bootstrapping generates artificial data from the original Artificial data is used to asses the accuracy Time-series bootstrapping preserves data structure [1] B. Efron, R. Tibshirani, “An introduction to the bootstrap”, chapter 8 The first step of finding the probabilities is generating artifical data by bootstrapping. Bootstrapping is a statistical process that generates instances of artificial data from the original data. Time-series Bootstrapping is an extension of regular bootstrapping that is applied to time-series waveform. It generates artificial waveform from The original waveform while preserving its underlying characteristics. The details of time-series can be found in statistics book, and a good one is mentioned here.

Probability of ‘0’ and ‘1’ Find the threshold for each bootstrapped sample Gives distribution of quantization threshold Go back and quantize with the new set The consensus gives probability Benefits: Correlation information between samples preserved No need for a noise model After obtaining the artificial samples, we find the threshold of quantization for each of them. It gives us a distribution of quantization threshold. Now that we have this distribution of threshold, we go back and quantize the original data by using the obtained threshold values. This gives us a set of quantized sample. The average over all this quantized samples gives the probability . For example, we generate 1000 artificial waveform, and then 1000 quantization threshold. If an instance in the original data-set is higher than 20% of these quantization threshold, this instance should be assigned to ‘1’ 20% of times and ‘0’ 80% of the times. So, we managed to preserve the information from correlation between variables while avoiding any noise model.

SOS network SOS network 8 genes, 50 time-sample, 4 experiments The true network is known Now we implement the method on SOS network. Which consists of expression of 8 genes over 50 time-sample, repeated 4 times. Its one of the best available data and the true network is known.

polB, experiment 1, SOS Gene expression Time In here you can see the waveform of gene expression of ‘polB’ from experiment 1. The original data is red, and as you can see There a couple of instances that are very close to the median of the waveform. So, the choice of quantization will severely impact the Accuracy of inferred network. Time

SOS, experiment-3, quantile quantization Bootstrapped Normal If the conventional quantization is performed by using just the median, the inferrred network will be something like the left graph. If the time-series bootstrap quantization is used, the inferred network will be something like the right graph. The dashed lines are false-positive and and solid ones are true recovered edges. The red edges are true edges but with wrong direction.

Results Banjo (15min search) Consensus over top 5 scoring networks Conventional True edges False edges True direction Exp1 2 11 Exp2 3 7 Exp3 1 Exp4 9 Average 7.5 0.75 Bootstrapped True edges False edges True direction Exp1 3 10 2 Exp2 9 Exp3 5 8 Exp4 4 Average 3.75 8.75 1.75 Banjo from Duke university is used for BN inference. The search last 15min and the inferred network is the consensus over the top 5-scoring networks. The results of BN inference is shown for 4 experiments of SOS networks. By using bootstrapped quantization, number of discovered true edges increases , and number of discovered true direction increases by almost two fold while the number of false edges increase by 17%

Conclusions Networks inferred from time-series gene expression Bayesian network is one of the most common Data needs quantization Time-series information is lost in conventional methods Information is retrieved by bootstrap quantization No noise model Correlation information used Better accuracy in inference ssWe saw that biological networks are inferred from microarray gene expression data. BN inference is one of the most common ones. Data needs quantization for BN inference. Common quantization techniques don’t’ take into the account The not correlation between time-samples. We have proposed a quantization method based on time-series bootstrapping. It requires no noise model and uses the correlation information between samples. More accurate networks can be inferred by using this method.