Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent

Slides:



Advertisements
Similar presentations
Table 2 shows that the set TFsf-TGblbs of predicted regulatory links has better results than the other two sets, based on having a significantly higher.
Advertisements

Promoter and Module Analysis Statistics for Systems Biology.
Combined analysis of ChIP- chip data and sequence data Harbison et al. CS 466 Saurabh Sinha.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
Genome-wide prediction and characterization of interactions between transcription factors in S. cerevisiae Speaker: Chunhui Cai.
Microarrays and Cancer Segal et al. CS 466 Saurabh Sinha.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Fuzzy K means.
Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.
Bryan Heck Tong Ihn Lee et al Transcriptional Regulatory Networks in Saccharomyces cerevisiae.
Epistasis Analysis Using Microarrays Chris Workman.
Motif finding: Lecture 1 CS 498 CXZ. From DNA to Protein: In words 1.DNA = nucleotide sequence Alphabet size = 4 (A,C,G,T) 2.DNA  mRNA (single stranded)
From motif search to gene expression analysis
Reconstructing Gene Networks Presented by Andrew Darling Based on article  “Research Towards Reconstruction of Gene Networks from Expression Data by Supervised.
Learning Structure in Bayes Nets (Typically also learn CPTs here) Given the set of random variables (features), the space of all possible networks.
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Kristen Horstmann, Tessa Morris, and Lucia Ramirez Loyola Marymount University March 24, 2015 BIOL398-04: Biomathematical Modeling Lee, T. I., Rinaldi,
Inferring transcriptional and microRNA-mediated regulatory programs in glioblastma Setty, M., et al.
Unraveling condition specific gene transcriptional regulatory networks in Saccharomyces cerevisiae Speaker: Chunhui Cai.
Analysis of the yeast transcriptional regulatory network.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Intel Confidential – Internal Only Co-clustering of biological networks and gene expression data Hanisch et al. This paper appears in: bioinformatics 2002.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Conference Report: Recomb Satellite NYC, Nov 2010 DREAM, Systems Biology and Regulatory Genomics.
By: Amira Djebbari and John Quackenbush BMC Systems Biology 2008, 2: 57 Presented by: Garron Wright April 20, 2009 CSCE 582.
IMPROVED RECONSTRUCTION OF IN SILICO GENE REGULATORY NETWORKS BY INTEGRATING KNOCKOUT AND PERTURBATION DATA Yip, K. Y., Alexander, R. P., Yan, K. K., &
Detecting binding sites for transcription factors by correlating sequence data with expression. Erik Aurell Adam Ameur Jakub Orzechowski Westholm in collaboration.
Introduction to biological molecular networks
Inference with Gene Expression and Sequence Data BMI/CS 776 Mark Craven April 2002.
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
Transcription factor binding motifs (part II) 10/22/07.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
BIOBASE Training TRANSFAC ® Containing data on eukaryotic transcription factors, their experimentally-proven binding sites, and regulated genes ExPlain™
Integrative Genomics I BME 230. Probabilistic Networks Incorporate uncertainty explicitly Capture sparseness of wiring Incorporate multiple kinds of data.
Inferring Regulatory Networks from Gene Expression Data BMI/CS 776 Mark Craven April 2002.
Projects
REVIEW Linear Combinations Given vectors and given scalars
WRKY transcription factors in potato genome factors in potato genome
Multi-task learning approaches to modeling context-specific networks
Detection of genome regulation sequences
Babak Alipanahi1, Andrew Delong, Matthew T Weirauch & Brendan J Frey
Carlos Chuquillanqui1 • Ian Barker1
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
1 Department of Engineering, 2 Department of Mathematics,
Carlos Chuquillanqui1 • Ian Barker1
Genomes and Their Evolution
Copyright Pearson Prentice Hall
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
12-5 Gene Regulation.
Volume 32, Issue 6, Pages (December 2008)
WRKY transcription factors in potato genome factors in potato genome
Evaluation of inferred networks
Revealing Global Regulatory Perturbations across Human Cancers
Volume 1, Issue 2, Pages (August 2015)
Principle of Epistasis Analysis
Mapping Global Histone Acetylation Patterns to Gene Expression
Copyright Pearson Prentice Hall
Parametric Methods Berlin Chen, 2005 References:
ChIP-seq Robert J. Trumbly
Revealing Global Regulatory Perturbations across Human Cancers
Copyright Pearson Prentice Hall
Copyright Pearson Prentice Hall
Volume 132, Issue 6, Pages (March 2008)
Predicting Gene Expression from Sequence
BIOBASE Training TRANSFAC® ExPlain™
Volume 26, Issue 12, Pages e5 (March 2019)
Copyright Pearson Prentice Hall
Label propagation algorithm
The Genetics of Transcription Factor DNA Binding Variation
Presentation transcript:

NetProphet 2.0: Mapping Transcription Factor Networks by Exploiting Scalable Data Resources Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent Washington University

Data generation is part of the algorithm Analysis Time Algorithm Free data Analyze data Data size Traditional Computer Science Systems Biology In traditional computer science, we are taught to think of data as free and design data analysis algorithms whose running time increases slowly as the amount of data increases. In systems biology, on the other hand, we MUST think of data generation as part of the algorithm and design algorithms whose total time, including data generation, is predictable and grows slowly as the amount of data needed. It is NOT sufficient to simply download a handful of existing data sets and show that an analysis algorithm performs well on them, without considering the cost of generating a new problem instance. Analyze data Generate Algorithm Total time / cost Data size (number of TFs)

Good and bad data sources for TF network mapping Classic Gene expression profiling NetProphet TF knockdown/out Genome sequencing NetProphet 2 From that perspective, there is good data and bad data. I class among the good data sources gene expression profiling, including profiling of cells in which a single transcription factor has been knocked down or knocked out, and genome sequencing. I classify TF binding locations determined by chromatin immunoprecipitation as bad data. Classic algorithms for TF network mapping – that is determining the direct targets of each TF, use only gene expression data. Another approach, called “integrative”, uses all available data sources, including the expensive and unreliable ones, but that’s not what we want to do. A few years ago, we published NetProphet, which improves on the accuracy of classic methods by exploiting expression profiles from cells in which a single TF has been knocked down or knocked out. Today, I’m going to describe NetProphet 2, which further increases accuracy by making use of data that can be extracted from genome sequences. Integrative TF binding locations (ChIP)

NP 2.0 is based on three ideas NetProphet 2 is based on three ideas.

Combining approaches analyzing gene expression data improves accuracy NetProphet 1.0 LASSO regression of target levels on TF RNA levels Probability that the target is differentially expressed when the TF is knocked down/out NP 2: Bayesian Additive Regression Trees First, combining approaches to analyzing expression data improves accuracy. NetProphet 1.0 combined LASS regression of target gene RNA levels on TF RNA levels with the probability that the target is differentially expressed when the TF knocked down or out. NetProphet 2 adds another method for predicting target gene expression levels from TF RNA levels, called Bayesian additive regression trees, or BART.

TFs with similar DNA binding domains bind similar DNA sequences Second, TFs whose DNA binding domains have similar amino acid sequences tend to bind similar DNA sequences, as shown in this figure for yeast TFs.

TF DBD sequence: Share evidence By averaging TF-target scores, weighted by DBD similarity To exploit this idea, we start with a score matrix in which rows correspond to TFs, columns to target genes, and entries to the likelihood that the TF directly regulates the target gene. We allow the row for each TF to borrow evidence from other TFs with similar DBDs by replacing each with a weighted sum of the other rows. The weights are calculated by the percent amino acid identity between the DNA binding domains corresponding to the two rows, as shown here for TF1. Rows for more similar TFs have greater weight.

TF binding specificities can be inferred from score matrices & promoter sequences For each TF Identify motifs in promoters that distinguish high-scoring from low-scoring targets (FIRE) Score all promoters for motif presence Combine this score matrix with others Finally, for each TF, we identify motifs in promoters that distinguish high-scoring from low-scoring target genes using an existing algorithm called FIRE. We then score all promoters in the genome for the presence of the inferred motif, creating a new score matrix that can be combined with the other score matrices.

Comparative evaluation by ChIP & binding potential on yeast and fly We evaluate NP 2, along with a variety of other network mapping algorithms, on both yeast and fly, using both binding locations from ChIP and binding potential from known PWMs as the standards. One other algorithm, Genie 3, performed slightly better than NetProphet 2 on the fly data with the PWM standard. ARACNe came close on the fruit fly using the ChIP standard. But overall, NetProphet 2 stood out as the most accurate and consisting, especially when evaluated against the intersection of the ChIP and PWM standards.

Thanks to student co-authors Yiming Kang Hien-Haw Liow

Expression data: Combining approaches improves accuracy NetProphet 1.0 Regression coefficient Measure of dif-ferential expression TF1 TF2 TF3 TF Affects Predicts We developed a method called NetProphet for mapping transcription factor networks by using profiles from TF-perturbation strain. NetProphet ranks all possible TF-target relations by a confidence score that combines a LASSO regression coefficient, for explaining the expression level of a target as a function of the expression of TFs, with a log odds that the target is DE when the TF is deleted. NetProphet requires only expression data, which are easy to generate using commodity methods. target target1 target2 target3 Described in Haynes (2013)