Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009.

Slides:



Advertisements
Similar presentations
Yinyin Yuan and Chang-Tsun Li Computer Science Department
Advertisements

. Context-Specific Bayesian Clustering for Gene Expression Data Yoseph Barash Nir Friedman School of Computer Science & Engineering Hebrew University.
Computational discovery of gene modules and regulatory networks Ziv Bar-Joseph et al (2003) Presented By: Dan Baluta.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
CSE Fall. Summary Goal: infer models of transcriptional regulation with annotated molecular interaction graphs The attributes in the model.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
Bioinformatics Motif Detection Revised 27/10/06. Overview Introduction Multiple Alignments Multiple alignment based on HMM Motif Finding –Motif representation.
. Inferring Subnetworks from Perturbed Expression Profiles D. Pe’er A. Regev G. Elidan N. Friedman.
Introduction of Probabilistic Reasoning and Bayesian Networks
Rich Probabilistic Models for Gene Expression Eran Segal (Stanford) Ben Taskar (Stanford) Audrey Gasch (Berkeley) Nir Friedman (Hebrew University) Daphne.
From Sequence to Expression: A Probabilistic Framework Eran Segal (Stanford) Joint work with: Yoseph Barash (Hebrew U.) Itamar Simon (Whitehead Inst.)
Regulatory Motifs. Contents Biology of regulatory motifs Experimental discovery Computational discovery PSSM MEME.
04/02/2006RECOMB 2006 Detecting MicroRNA Targets by Linking Sequence, MicroRNA and Gene Expression Data Joint work with Quaid Morris (2) and Brendan Frey.
Seeing the forest for the trees : using the Gene Ontology to restructure hierarchical clustering Dikla Dotan-Cohen, Simon Kasif and Avraham A. Melkman.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
Heuristic alignment algorithms and cost matrices
Transcription factor binding motifs (part I) 10/17/07.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Integrated analysis of regulatory and metabolic networks reveals novel regulatory mechanisms in Saccharomyces cerevisiae Speaker: Zhu YANG 6 th step, 2006.
6. Gene Regulatory Networks
Module Networks Discovering Regulatory Modules and their Condition Specific Regulators from Gene Expression Data Cohen Jony.
Regulatory element detection using correlation with expression (REDUCE) Literature search WANG Chao Sept 14, 2004.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
Genetic network inference: from co-expression clustering to reverse engineering Patrik D’haeseleer,Shoudan Liang and Roland Somogyi.
Gene Set Enrichment Analysis (GSEA)
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Networks and Interactions Boo Virk v1.0.
Outline Who regulates whom and when? Model Learning algorithm Evaluation Wet lab experiments Perspective: why does it work? Reg. ACGTGC.
Expectation Maximization and Gibbs Sampling – Algorithms for Computational Biology Lecture 1- Introduction Lecture 2- Hashing and BLAST Lecture 3-
Recitation on EM slides taken from:
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Gene expression analysis
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Identification of cell cycle-related regulatory motifs using a kernel canonical correlation analysis Presented by Rhee, Je-Keun Graduate Program in Bioinformatics.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
Problem Limited number of experimental replications. Postgenomic data intrinsically noisy. Poor network reconstruction.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Statistical Testing with Genes Saurabh Sinha CS 466.
Introduction to biological molecular networks
Cluster validation Integration ICES Bioinformatics.
Flat clustering approaches
Local Multiple Sequence Alignment Sequence Motifs
Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network Science, Vol 292, Issue 5518, , 4 May 2001.
Module Networks BMI/CS 576 Mark Craven December 2007.
Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.
Elucidating regulatory mechanisms downstream of a signaling pathway using informative experiments Discussion leader: Navneet Scribe: James Computational.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Identifying submodules of cellular regulatory networks Guido Sanguinetti Joint work with N.D. Lawrence and M. Rattray.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
System Structures Identification
Learning Sequence Motif Models Using Expectation Maximization (EM)
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
FUNCTIONAL ANNOTATION OF REGULATORY PATHWAYS
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Predicting Gene Expression from Sequence
Presentation transcript:

Inferring Cellular Networks Using Probabilistic Graphical Models Jianlin Cheng, PhD University of Missouri 2009

Bayesian Network Software T/bnsoft.html

Demo

Research in molecular biology is undergoing a revolution mRNA transcript quantities protein-protein protein-DNA interactions chromatin structure Protein quantities Protein localization Protein modification

Challenge Provide methodologies for transforming high- throughput heterogeneous data sets into biological insights about the underlying mechanisms Data is noisy Data integration Generate Hypothesis

Biological Networks – Gene Regulatory Networks

Signal Transduction Network

Protein Interaction Network

Model-Based Approaches VS Procedure Approaches Procedure: Binding sites – Gene expression. (a) cluster co-expressed genes to find common sites (b) group genes with similar binding sites and test if they are coexpressed Declarative: design a model that describes the relations between the two types of data. Learn parameter from data and make predictions

Probabilistic Models Stochasticity for measurement noise Learning Algorithms Select model that fits the actual observations Inference Make predictions Generate insights and hypothesis

Modeling Examples Hidden Markov Model for sequence analysis Probabilistic Graphical Model for cellular networks

Advantages Concise language for describing probability distributions over the observations Approaches to learning from data that are derived from basic well-understood principles Use of observations to fill in model details Provide principles for combining multiple local models into a joint global model Declarative nature provides an advantage to extend model to account for additional aspects of the system

Infer Gene Regulatory Network from Gene Expression Data

Model for gene expression and cis- regulatory elements Assumptions 1: genes can be partitioned into clusters of coexpressed genes, and the genes in each cluster have a typical expression level in each array. Assumption 2: arrays are partitioned into array clusters, which capture relevant biological context, and that the expression of a gene is roughly the same in the arrays that belong to the same array cluster

Random Variables X g,a, where g is an index over gene and a is an index over arrays GeneCluster g : denotes the cluster assignment of gene g ArrayCluster a denotes the cluster assignment of array a. Assumption: the expression of gene g in array a depends on the value of GeneClusterg and ArrayClustera

Regular Bayesian Networks

Conditional Distribution

Learning Models from Data Parameter estimation – maximum likelihood problem ( P(data| model)) Model selection: select among different model structures to find one that best reflects the dependencies in the domain. P(model | data)

The model just described can achieve high likelihood if the cluster and gene assignment partitions the original measurements into blocks with approximately uniform expression within each block

Expectation Maximization procedure that iterates between an E-step, which uses current parameters to find the probabilistic cluster assignment of genes and arrays, and an M-step, which re-estimates the distribution within each gene/array cluster combination on the basis of this assignment.

Co-Regulation A key regulation mechanism involves binding of transcription factors to promoter regions of genes. we aim to identify the transcription factor binding sites in the promoter region of genes that can explain observed co-expression.

Regulatory Model Rg,j as depending on the promoter sequence Seqg

Integration of Sequence and Expression Data The parameters of this conditional probability characterize the specific motif recognized by the transcription factor. This extension allows us to learn the characterization of the binding site while learning how its presence influences gene expression.

A crucial detail in building such a model is the representation of the conditional distributions associated with GeneClusterg. This distribution describes how the existence of binding sites in the promoter region determines (or predicts) what cluster the gene belongs to. The conditional probabilities explored so far involve fairly generic representation of decision trees (14) or additive votes (15).

Data integration a pair of interacting proteins are more likely to belong to the same coregulated cluster assume that active transcription factor binding sites should correspond to observations of transcription factor location data

Reconstruction of Regulatory Networks A key challenge in gene expression analysis is the reconstruction of regulatory networks. We then use tools for structure learning in Bayesian networks (22, 23) to determine the network architecture

Challenges The second and more difficult challenge is the biological interpretability of the results. Can we really distinguish regulation from coexpression? Do these methods discover direct or indirect regulation? How do unobserved posttranscriptional events affect the conclusions?

Module Networks The second and more difficult challenge is the biological interpretability of the results. Can we really distinguish regulation from coexpression? Do these methods discover direct or indirect regulation? How do unobserved posttranscriptional events affect the conclusions? (Nature Genetics)

A regulatory module is a set of genes that are regulated in convert by a shared regulation program. A regulation program specifies the behavior of the genes in the module as a function of the expression level of a small set of regulators

Procedure Inputs: a gene expression data set and a large precompiled set of candidate regulatory genes for the corresponding organism (independent of data set) containing both know and putative transcription factors and signal transduction molecules Goal: search for a partition of genes into modules and for a regulation program for each module

Output: a list of modules and associated regulation programs

Results: apply the method to Yeast gene expression data set consisting of 2355 genes and 173 arrays. Each inferred modules contained a functionally coherent set of genes (metabolic pathways, oxidative stress, cell cycle-related processes, etc) Many module has a match between predicted regulator and its known cis-regulatory binding motif.

The second and more difficult challenge is the biological interpretability of the results. Can we really distinguish regulation from coexpression? Do these methods discover direct or indirect regulation? How do unobserved posttranscriptional events affect the conclusions?

Experimental Verification The second and more difficult challenge is the biological interpretability of the results. Can we really distinguish regulation from coexpression? Do these methods discover direct or indirect regulation? How do unobserved posttranscriptional events affect the conclusions? The second and more difficult challenge is the biological interpretability of the results. Can we really distinguish regulation from coexpression? Do these methods discover direct or indirect regulation? How do unobserved posttranscriptional events affect the conclusions?

One Example

Row: genes Column: arrays

Evaluation of Module Content and Regulation Program We evaluate all 50 modules to test whether the proteins encoded by genes in the same module had related functions. We scored the functional/biological coherence of each module according to percentage of its genes covered by annotations. Most of modules had a coherence level above 50%.

Candidate regulators Compiled a set of 466 candidate regulators annotated in Yeast Genome and Proteome databases Use Yeast gene expression data set conisting of 173 microarrays that measure responsts to various stress conditions. We downloaded these data in log (base 2) ratio to control format from Stanford Microarry Database. Chose a subset of 2355 genes that have a significant change in gene expression under the measured stress conditions

Protein annotations: downloaded Gene Ontology and Munich Information center for Protein Sequence (MIPS) function and KEGG. Regulation program: context (upregulation, no change, down regulation). Regression tree (decision nodes and leaf nodes); the model semantics is that given a gene g in the module and an array a in a context, the probability of observing some expression value for a gene in array is governed by the normal distributino specified for the context.

Learning Module Networks In each iteration, the procedure searches for a regulation program for each module and then reassign each gene to the module whose program best predicts its behavior. Repeated until it converges. Search for the model with the highest score by using the EM algorithm.

M-Step: given a partition of genes into modules and learns the best regulation program (regression tree) fro each module. The regulation program is learned through a combinatorial search over the space of trees. The tree is grown from the root to its leaves. At any given node, the query that best partitions the gene expression into two distinct distribution is chosen.

E-step: given the inferred regulation programs, we determine the module whose associated regulation program best predicts each gene’s behavior. Select the module whose program gives the gene’s expresson profile the highest probability and re-assign the gene to this module. We initialize our modules to 50 clusters using Pcluster, a hierahical agglomerative clustering. We then applied the EM algorithm to this starting point, refining both the gene partitioin and the regulatory program.

Evaluating statistical significance of modules All of the statistical evaluations were done and visualized in GeneXPress. The tool can evaluate the output of any clustering program for enrichment of gene annotations and motifs

Annotation enrichment We associated each gene with the processes in which it participates. Resulted in 923 GO categories, 208 MIPS categories, and 87 KEGG pathways. For each module and for each annotation, we calculated the fration of genes in the module associated with that annotaiton and used the hypergeometric distribution to calculate a P value for this fraction.

Promoter Analysis We search for motifs (represented as Position- Specific Scoring Matrices) within 500 bp upstream of each gene. We downloaded TRANSFAC, containing 34 known function cis- regulatory motifs. We also use a motif finder to find 50 potentially novel motifs.

Motif Combination We searched for statistically significant occurences of motif pairs. We constructed a motif pair attribute, which assigns a “true” value for each gene if and on if both motifs of the pair are found in the upstream region of that gene. For each module and for each motif pair attribute, we calculated the fraction of genes in the module associated with that attribute and used the hypergeometric distribution to calculate a P value for this fraction.

Regulator Annotations We associate regulatros with annotaions and binding sites in he same way we assocte with these attributes to the modules. Because a regulator may regulate more than one module, its targets consist of the unioin of the genes in all modules predicted to be regulated by that regulator. We tested the targets of each regulator for enrichment of the same motifs and gene annotations as above using the hypergeometric P value.