Experiments We used our optimal frontier breadth-first search algorithm to learn an optimal Bayesian network over the 23-variable data set and compared.

Slides:



Advertisements
Similar presentations
Lecture 3. Felsenfeld & Groudine, Nature 2003 H2A, H2B, H3 and H4.
Advertisements

Methods to read out regulatory functions
Regulomics II: Epigenetics and the histone code Jim Noonan GENE760.
Fast Algorithms For Hierarchical Range Histogram Constructions
The multi-layered organization of information in living systems
Experiments We measured the times(s) and number of expanded nodes to previous heuristic using BFBnB. Dynamic Programming Intuition. All DAGs must have.
1/21 Finding Optimal Bayesian Network Structures with Constraints Learned from Data 1 City University of New York 2 University of Helsinki Xiannian Fan.
Graduate Center/City University of New York University of Helsinki FINDING OPTIMAL BAYESIAN NETWORK STRUCTURES WITH CONSTRAINTS LEARNED FROM DATA Xiannian.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Bioinformatics GIS Applications Anatoly Petrov.
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Cs726 Modeling regulatory networks in cells using Bayesian networks Golan Yona Department of Computer Science Cornell University.
High-resolution genome-wide mapping of histone modifications Tae-young Roh*, Wing Chi Ngau+, Kairong Cui*, David Landsman+ & Keji Zhao* *Laboratory of.
Graph, Search Algorithms Ka-Lok Ng Department of Bioinformatics Asia University.
27803::Systems Biology1CBS, Department of Systems Biology Schedule for the Afternoon 13:00 – 13:30ChIP-chip lecture 13:30 – 14:30Exercise 14:30 – 14:45Break.
Computational Approaches in Epigenomics Guo-Cheng Yuan Department of Biostatistics and Computational Biology Dana-Farber Cancer Institute Harvard School.
6. Gene Regulatory Networks
1 gR2002 Peter Spirtes Carnegie Mellon University.
Artificial Intelligence Term Project #3 Kyu-Baek Hwang Biointelligence Lab School of Computer Science and Engineering Seoul National University
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Cmpt-225 Simulation. Application: Simulation Simulation  A technique for modeling the behavior of both natural and human-made systems  Goal Generate.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Organization of DNA Within a Cell from Lodish et al., Molecular Cell Biology, 6 th ed. Fig meters of DNA is packed into a 10  m diameter cell.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
A* Lasso for Learning a Sparse Bayesian Network Structure for Continuous Variances Jing Xiang & Seyoung Kim Bayesian Network Structure Learning X 1...
Reconstructing gene networks Analysing the properties of gene networks Gene Networks Using gene expression data to reconstruct gene networks.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Using Bayesian Networks to Analyze Whole-Genome Expression Data Nir Friedman Iftach Nachman Dana Pe’er Institute of Computer Science, The Hebrew University.
Eukaryotic Genome & Gene Regulation The entire genome of the eukaryotic organism is present in every cell of the organism. Although all genes are present,
ChIP-chip Data. DNA-binding proteins Constitutive proteins (mostly histones) –Organize DNA –Regulate access to DNA –Have many modifications Acetylation,
I519 Introduction to Bioinformatics, Fall, 2012
Part 1: Biological Networks 1.Protein-protein interaction networks 2.Regulatory networks 3.Expression networks 4.Metabolic networks 5.… more biological.
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Regulation of Gene Expression. You Must Know The functions of the three parts of an operon. The role of repressor genes in operons. The impact of DNA.
Abstract ODE System Model of GRNs Summary Evolving Small GRNs with a Top-Down Approach Javier Garcia-Bernardo* and Margaret J. Eppstein Department of Computer.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Biological Signal Detection for Protein Function Prediction Investigators: Yang Dai Prime Grant Support: NSF Problem Statement and Motivation Technical.
Eukaryotic Genomes  The Organization and Control of Eukaryotic Genomes.
Eukaryotic Genomes: Organization, Regulation and Evolution.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
While gene expression data is widely available describing mRNA levels in different cancer cells lines, the molecular regulatory mechanisms responsible.
Introduction to biological molecular networks
341- INTRODUCTION TO BIOINFORMATICS Overview of the Course Material 1.
CS173 Lecture 9: Transcriptional regulation III
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
STAT115 STAT225 BIST512 BIO298 - Intro to Computational Biology.
1 An Efficient Optimal Leaf Ordering for Hierarchical Clustering in Microarray Gene Expression Data Analysis Jianting Zhang Le Gruenwald School of Computer.
Outline Molecular Cell Biology Assessment Review from last lecture Role of nucleoporins in transcription Activators and Repressors Epigenetic mechanisms.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Content What is epigenetics?. The Mapping of the Human Genome Project 2000 A working draft but completed in 2003 Only 20,000–25,000 genes! Only 1.5% of.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Learning gene regulatory networks in Arabidopsis thaliana
Eukaryotic Genome & Gene Regulation
Inferring Models of cis-Regulatory Modules using Information Theory
Building and Analyzing Genome-Wide Gene Disruption Networks
1 Department of Engineering, 2 Department of Mathematics,
A Short Tutorial on Causal Network Modeling and Discovery
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
An Algorithm for Bayesian Network Construction from Data
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
SEG5010 Presentation Zhou Lanjun.
Dynamic Regulation of Nucleosome Positioning in the Human Genome
Presentation transcript:

Experiments We used our optimal frontier breadth-first search algorithm to learn an optimal Bayesian network over the 23-variable data set and compared it to a greedy search, used previously [Yu]. Figures 2 and 3 show the learned networks. Our Optimal Search Formulation As suggested by Equation 2, learning an optimal Bayesian network consists of three phases which we formulate as search problems. Calculating Scores Goal. Calculate MDL(X|U), which is the score of X using U as parents Representation. AD-tree [Moore] Search Strategy. Depth-first AD Node. Records with U= u Vary Node. Records with U = u, X = x Successor. Instantiate a new X Storage. Written to disk ϕ AB bbaa BB ababab abab Vary Node N x,u AD Node N u Optimal Learning with Dynamic Programming In the case of a ChIP-Seq dataset, we do not know the relationships among the variables. Therefore, we must learn them. Singh and Moore [2005] proposed a dynamic programming algorithm to learn an optimal Bayesian network which minimizes the MDL score. The figure below shows the intuition behind the algorithm. Equation 2 expresses this recursively. Silander and Myllmaki [2006] refined the algorithm by reversing the process. ChIP-Seq We can measure the presence of a particular histone modification in cells using chromatin immunopreciptation followed by high throughput sequencing (ChIP-Seq). The figure below shows the ChIP-Seq process. The Epigenetic Code The central dogma of molecular biology (roughly) states that DNA is transcribed into RNA which is translated into proteins. Proteins perform many of the functions in the body. We have the same DNA in most of our cells, yet they perform quite different functions. One reason for this differentiation lies in the epigenetic code. When DNA forms chromosomes, it packs together very tightly into a structure called chromatin. The DNA coils around a group of eight proteins called histones. Figure 1 summarizes chromatin packaging. The histone proteins include a tail domain which is very susceptible to a large number of post-translational modifications which affect the attraction between histones. The attraction can increase between histones, tightening surrounding chromatin and suppressing expression. Chromatin can also loosen, increasing expression. The combination of present modifications determines the effect on the chromatin structure. Some histone modifications affect the likelihood of other modifications. The epigenetic code [Jaenisch] proposes that the combination of histone modifications, as well as other features such as the presence of transcription factor binding sites, serves as a type of message to present and future generations of cells about regulation. Selected References Jaenisch, R. & Bird, A. Epigenetic regulation of gene expression: how the genome integrates intrinsic and environmental signals Nature Genetics, 2003, 33, Schwarz, G. (1978). "Estimating the Dimension of a Model." The Annals of Statistics 6(2): Barski, A., S. Cuddapah, et al. (2007). "High-resolution profiling of histone methylations in the human genome." Cell 129(4): 823 – 837 Singh, A. P. and A. W. Moore (2005). Finding optimal bayesian networks by dynamic programming (Technical Report). Carnegie Mellon Univ: 05—106. Silander, T. and P. Myllymaki (2006). A simple approach for finding the globally optimal Bayesian network structure. Proceedings of the 22nd Annual Conference on Uncertainty in Artificial Intelligence (UAI-06), AUAI Press. Yu, H., S. Zhu, et al. (2008). "Inferring causal relationships among different histone modifications and gene expression." Genome Research 18(8): Yuan, C.; Malone, B. & Wu, X. Learning Optimal Bayesian Networks using A* Search. Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011 Seq-ing the Epigentic Code with Exact Bayesian Network Structure Learning Brandon M. Malone 1,2, Changhe Yuan 1, Eric Hansen 1 and Susan M. Bridges 1,2 1 Department of Computer Science & Engineering, Mississippi State University 2 Institute for Genomics, Biocomputing and Biotechnology, Mississippi State University. Abstract The epigenetic code [Jaenisch] hypothesis proposes that patterns of post-translational modifications to the histone core proteins, the presence of transcription factor binding sites and other genomic features influence expression of associated DNA. Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChIP-Seq) is frequently used to characterize these features at a genome-wide scale. Previous studies [Yu] have used approximation techniques to learn relationships among them. In this work, we apply a novel exact Bayesian network learning algorithm to learn a network structure which identifies regulatory relationships among a set of epigenetic features in human CD4 cells [Barksi]. Comparison to networks learned using greedy methods reveals that our network identifies more biologically relevant relationships. By applying an exact, optimal learning algorithm instead of an approximate, greedy algorithm, the relationships we learn are unaffected by sources of uncertainty stemming from the structure learning algorithm. Bayesian Networks Representation. Joint probability distribution over a set of variables Structure. Directed acyclic graph storing conditional dependencies. Vertices correspond to variables. Edges indicate relationships among variables. Parameters. Conditional probability tables quantifying relationships Scoring. Minimum Description Length (MDL) [Schwartz], Equation 1 Acknowledgments This material is based on work supported by the National Science Foundation under Grants No. NSF EPS and NSF IIS The sequenced DNA is mapped back to the genome. [Illumina] Raw DNA The DNA is sheared into pieces around 200 bp in length. Pieces are immunoprecipitated against an antibody to extract desired pieces. The remaining pieces of DNA are sequenced. Pol II H3K36 me H3K9 ac H3K27 me3 Expr H3K4 me3 Pol II H3K36 me H3K9 ac H3K27 me3 H3K4 me3 Pol II H3K36 me H3K9 ac H3K27 me3 H3K4 me3 Pol II H3K9 ac H3K27 me3 H3K4 me3 The optimal Bayesian network structure is a DAG, so it has a leaf variable with no children. Remove that leaf and its edges from the network.. The remaining subnetwork is also a DAG, so it has a leaf. Recursively find optimal leaves until an empty subnetwork remains. Frontier Breadth-first Branch and Bound Search The order graph has a very regular structure. The successors for a node in layer l always appear in layer l+1. This observation allows us to keep only two layers in memory rather than all n. Furthermore, we can calculate how good a particular node can possibly be. If this is worse than a known bound, we safely disregard it. If optimality is not needed, we disregard many nodes to reduce running time. Data Set and Preprocessing Raw Data. 30 human ChIP-Seq experiments [Barski] Cellular Environment. CD4 cells (specialized white blood cells) Normalization. Linear regression, against an IgG control data set Discretization. Clustered genes using MDL for each experiment Processed Data Set. A numeric array of length 30 for each gene Results and Discussion We focused on the transcription factor binding site for CTCF, known to play a function in the regulation of many elements. We expect CTCF to be an ancestor of important regulatory elements. In our network, CTCF is parent of the five most highly connected regulatory elements in the network. The approximate algorithm identified four parents and three children of intermediate degree for CTCF. Identifying Optimal Parent Sets Goal. Calculate BestScore(U, X), which selects the best parents of X from U Representation. Sorted and bit arrays Search Strategy. On demand Successor. Use bit operators to find scores consistent with U\Y Score. scores[firstBit(usable(X))] Storage. Arrays and bit sets Learning Optimal Subnetworks Goal. Calculate Score(U), which is the best subnetwork for variables U. Representation. Order graph [Yuan] Search Strategy. Breadth-first Node. Score(U) for some U. Successor. Use X as a leaf of U Score. Score(U) + BestScore(U, X) Storage. Hash table or written to disk Expand(U) For each X in U newScore = U.score + BestScore(U, X) succ = get({U+ X}) if newScore < succ.score put({U+ X}, newScore) Figure 1. Chromatin packaging and histones. ( Equations (1) (2) Figure 2. Learned structure with our optimal algorithm. Figure 3. Learned structure with a standard greedy algorithm. Conclusions We presented a frontier breadth-first search algorithm for learning optimal Bayesian networks that improves the memory complexity from O(2 n ) to O(C(n,n/2)). Provably optimal solutions allow us to focus on interpreting the results. We learned the optimal structure of a network of epigeneitc features; it included more biologically meaningful relationships than structures learned with greedy search. parents{1,2}{2}{1}{1,3}{3}{}{2,3} scores uses[1]XXX usableXXXXXXX XXXX Calculate and sort all of the scores for a variable. Mark which scores use each variable (n-1 of these each). Initially, a variable can use all scores. The first is optimal. When X is used as a leaf, find the usable parent scores with (usable & ~uses[X]). The first set bit is optimal. ϕ 123 1,21,32,3 1,2,3 4 1,42,43,4 1,2,41,3,42,3,4 1,2,3,4