Download presentation
Presentation is loading. Please wait.
1
Genome-wide Analysis of Gene regulation Berlin, 4th of May, 2005Presentation by: David Rozado
2
Comparative analysis of methods for representing and searching for transcription factor binding sites Robert Osada, Elena Zaslavsky and Mona Singh Department of Computer Science & Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
3
Introduction Identification of DNA binding sites for transcription factors – Important step in unraveling the transcriptional regulatory network Several approaches for transcription factor’s binding sites search –consensus sequences –position-specific scoring matrices –Berg and von Hippel –Centroid Such basic approaches can all be extended by incorporating: –pair wise nucleotide dependencies –per-position information content The paper evaluates the effectiveness of the basic approaches and their extensions in finding binding sites for a transcription factor of interest
4
Datasets 68 regulatory proteins and their aligned DNA binding domains Number of Filters applied: –Only proteins with at least four binding sites were considered –Duplicate binding sites were removed in order to preserve the integrity of the leave-one-out cross-validation –Each binding site unambiguously located within the E.coli K-12 genome and extracted along with flanking regions on each side This process left 35 transcription factors and 410 binding sites –Average of 11.7 ± 8.5 sites per transcription factor
5
Notation S: set of N DNA binding sites for a transcription factor n j (b): number of times base b appears in the j -th position in S f j (b): corresponding frequency n(b): number of times base b appears overall in the N binding sites f (b): overall frequency for base b n ij: (b, d): number of times the ordered pair (b, d) occurs in positions i and j f ij: (b, d) corresponding frequency t j:: j -th base of the sequence t to be scored S j t i
6
Approaches for representing and searching for binding sites
7
Extension I - Pairwise correlations A method for incorporating pairwise correlations should only take into account those pairs that act together in determining DNA– protein binding specificity. Such precise information is not always readily available As approximation, focus on considering pairwise correlations between bases that are nearby in sequence Introduce the notion of scope to delimit which pairs are considered correlated. –A scope of one restricts correlated positions to adjacent pairs –a scope of two considers both adjacent pairs and pairs separated by an intermediate base
8
Extensions II - Information content Information content (IC) is a concept based on the information-theoretic notion of entropy. In the current application, the entropy of a position expresses the number of bits necessary to describe the position in a binding site The information content of a position is calculated by subtracting its entropy from the value of the maximum possible entropy The higher the information content, the more conserved (and presumably more important) the position
9
Cross-validation testing and analysis Common usage of any of the methods described above would be to scan non-coding regions in a genome in order to find binding sites for a particular transcription factor Such a framework is not easily applicable when we wish to evaluate and compare different methods –The E.coli genome contains many yet uncharacterized binding sites –Predicted windows may correspond to true binding sites even if they are not annotated as such in the original dataset Testing framework with sets of positive and negative examples
10
Cross-validation testing and analysis II Conduct leave-one-out cross-validation studies to evaluate a particular method Suppose s belongs to a set S of known binding sites, each of length l, for a particular transcription factor TF The method under consideration uses all the sites except s, to build the binding site representation for TF, and scores s as well as a set of negative examples The negative examples consist of all binding sites in our dataset except those known to be bound by TF It is still is possible that transcription factor TF can bind some of the negative examples Nevertheless, s should be among the top scoring sites in the overall pool
11
Comparing Methods For each site s of a transcription factor under consideration a rank in cross- validation testing is computed by counting how many negative examples score as well or better than s –lower rank indicating better performance To compare how well two methods perform, a Wilcoxon matched-pairs signed-ranks test is used The number of times one method outperforms the other is compared with how many times such an event would happen merely by chance under the assumption that both methods perform equally well A ROC curve for each individual leave-one-out test is created and then, the average over all sites for that transcription factor is computed
12
Comparison of basic methods
13
ROC curves comparing performance when pairs are considered for Centroid
14
ROC curves comparing performance when pairs are considered for PSSM
15
ROC curves comparing Centroid-P with scope 2 using regular sites and sites with columns shuffled
16
Performance of methods based on averaged ranks per transcription factor
17
Conclusions Using per-position information content to weigh positional scores improves the performance of all methods –Sometimes dramatically Methods based on nucleotide matches, such as consensus sequences and Centroid, show statistically significant improvements when incorporating pairwise nucleotide dependences –Probabilistic methods, such as log-odds PSSMs, do not show statistically significant improvements when incorporating pairwise dependencies Difference in performance between methods decreases substantially once information content and pairwise correlations have been incorporated
18
Making connections between novel transcription factors and their DNA motifs Kai Tan,1 Lee Ann McCue,2 and Gary D. Stormo1,3 1Department of Genetics, Washington University School of Medicine, Saint Louis, Missouri 63110, USA; 2The Wadsworth Center, New York State Department of Health, Albany, New York 12201-0509, USA
19
Introduction A computational method to connect novel transcription factors and DNA motifs in E. coli The method takes advantage of three types of information to assign a DNA binding motif to a given TF 1.A distance constraint between a TF and its closest binding site in the genome (D min information) 2.The phylogenetic correlation between TFs and their regulated genes (PC information) 3.A binding specificity constraint for TFs having structurally similar DNA-binding domains (FMC information) The different types of information are combined to calculate the probability of a given transcription-factor–DNA-motif pair being a true pair
20
Distance constraint Besides auto-regulation, it has been noticed in many cases that TFs and the genes they regulate are near each other in the genome –Distance constraint between the TF and its closest binding site in the genome D min_self is the distance between a TF gene and its closest binding site in the genome D min_cross is the distance between a TF gene and the closest binding site for a different TF
21
The phylogenetic correlation TFs and their regulated genes tend to evolve concurrently Connect TFs and DNA motifs through correlation between their occurrences in a comparative analysis of multiple species Two types of phylogenetic correlation (PC) distributions –PC for true TF–DNA-motif pairs. –PC for false TF–DNA-motif pairs
22
Binding specificity constraint TFs that are more similar to one another are expected to bind to sites that are more similar to each other than to dissimilar pairs Distribution of average similarity scores for motifs from the same family and from different families
23
Conclusions Hypothesize that information concerning the connection of a TF to its DNA motif is carried in the genome sequences TFs and their binding sites are often in similar genomic locations (Dmin information) TFs tend to evolve concurrently with their regulated genes (PC information) TFs from the same structural family tend to have similar DNA motifs
24
Functional determinants of transcription factors in Escherichia coli: protein families and binding sites M. Madan Babu and Sarah A. Teichmann MRC Laboratory of Molecular Biology, Hills Road, Cambridge CB2 2QH, UK
25
Introduction DNA-binding transcription factors regulate expression of genes near to where they bind These factors can be activators or repressors of transcription, or both A fundamental question is what determines whether a transcription factor acts as an activator or a repressor? –Protein–protein contacts –Position of the DNA-binding domain in the protein primary sequence –Altered DNA structure, –Position of its binding site on the DNA relative to the transcription start site This work suggest that, in general, in E. Coli, a transcription factor’s protein family is not indicative of its regulatory function, but the position of its binding site on the DNA is
26
Domain Architectures for different TFs
28
Conclusions Activators, repressors and dual regulators in E. coli belong to many of the same protein families and share some domain architectures A transcription factor’s regulatory role is not determined by protein structure or evolutionary relationships –Transcription factors have evolved by duplication of an ancestral transcription factor, followed by a change in function through a shift in binding sites A transcription factor’s regulatory role is determined to a large extent simply by the position of the transcription factor binding site –Activators have essentially only upstream binding sites –More than two thirds of repressors have at least one downstream binding site
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.