NetProphet 2.0: Mapping Transcription Factor Networks by Exploiting Scalable Data Resources Yiming Kang, Hien-haw Liow, Ezekiel Maier, & Michael Brent Washington University
Data generation is part of the algorithm Analysis Time Algorithm Free data Analyze data Data size Traditional Computer Science Systems Biology In traditional computer science, we are taught to think of data as free and design data analysis algorithms whose running time increases slowly as the amount of data increases. In systems biology, on the other hand, we MUST think of data generation as part of the algorithm and design algorithms whose total time, including data generation, is predictable and grows slowly as the amount of data needed. It is NOT sufficient to simply download a handful of existing data sets and show that an analysis algorithm performs well on them, without considering the cost of generating a new problem instance. Analyze data Generate Algorithm Total time / cost Data size (number of TFs)
Good and bad data sources for TF network mapping Classic Gene expression profiling NetProphet TF knockdown/out Genome sequencing NetProphet 2 From that perspective, there is good data and bad data. I class among the good data sources gene expression profiling, including profiling of cells in which a single transcription factor has been knocked down or knocked out, and genome sequencing. I classify TF binding locations determined by chromatin immunoprecipitation as bad data. Classic algorithms for TF network mapping – that is determining the direct targets of each TF, use only gene expression data. Another approach, called “integrative”, uses all available data sources, including the expensive and unreliable ones, but that’s not what we want to do. A few years ago, we published NetProphet, which improves on the accuracy of classic methods by exploiting expression profiles from cells in which a single TF has been knocked down or knocked out. Today, I’m going to describe NetProphet 2, which further increases accuracy by making use of data that can be extracted from genome sequences. Integrative TF binding locations (ChIP)
NP 2.0 is based on three ideas NetProphet 2 is based on three ideas.
Combining approaches analyzing gene expression data improves accuracy NetProphet 1.0 LASSO regression of target levels on TF RNA levels Probability that the target is differentially expressed when the TF is knocked down/out NP 2: Bayesian Additive Regression Trees First, combining approaches to analyzing expression data improves accuracy. NetProphet 1.0 combined LASS regression of target gene RNA levels on TF RNA levels with the probability that the target is differentially expressed when the TF knocked down or out. NetProphet 2 adds another method for predicting target gene expression levels from TF RNA levels, called Bayesian additive regression trees, or BART.
TFs with similar DNA binding domains bind similar DNA sequences Second, TFs whose DNA binding domains have similar amino acid sequences tend to bind similar DNA sequences, as shown in this figure for yeast TFs.
TF DBD sequence: Share evidence By averaging TF-target scores, weighted by DBD similarity To exploit this idea, we start with a score matrix in which rows correspond to TFs, columns to target genes, and entries to the likelihood that the TF directly regulates the target gene. We allow the row for each TF to borrow evidence from other TFs with similar DBDs by replacing each with a weighted sum of the other rows. The weights are calculated by the percent amino acid identity between the DNA binding domains corresponding to the two rows, as shown here for TF1. Rows for more similar TFs have greater weight.
TF binding specificities can be inferred from score matrices & promoter sequences For each TF Identify motifs in promoters that distinguish high-scoring from low-scoring targets (FIRE) Score all promoters for motif presence Combine this score matrix with others Finally, for each TF, we identify motifs in promoters that distinguish high-scoring from low-scoring target genes using an existing algorithm called FIRE. We then score all promoters in the genome for the presence of the inferred motif, creating a new score matrix that can be combined with the other score matrices.
Comparative evaluation by ChIP & binding potential on yeast and fly We evaluate NP 2, along with a variety of other network mapping algorithms, on both yeast and fly, using both binding locations from ChIP and binding potential from known PWMs as the standards. One other algorithm, Genie 3, performed slightly better than NetProphet 2 on the fly data with the PWM standard. ARACNe came close on the fruit fly using the ChIP standard. But overall, NetProphet 2 stood out as the most accurate and consisting, especially when evaluated against the intersection of the ChIP and PWM standards.
Thanks to student co-authors Yiming Kang Hien-Haw Liow
Expression data: Combining approaches improves accuracy NetProphet 1.0 Regression coefficient Measure of dif-ferential expression TF1 TF2 TF3 TF Affects Predicts We developed a method called NetProphet for mapping transcription factor networks by using profiles from TF-perturbation strain. NetProphet ranks all possible TF-target relations by a confidence score that combines a LASSO regression coefficient, for explaining the expression level of a target as a function of the expression of TFs, with a log odds that the target is DE when the TF is deleted. NetProphet requires only expression data, which are easy to generate using commodity methods. target target1 target2 target3 Described in Haynes (2013)