McPromoter – an ancient tool to predict transcription start sites

McPromoter – an ancient tool to predict transcription start sites
Uwe Ohler Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen)

An extremely simplified view of eukaryotic transcription
Specific information about functional context of genes: proximal promoter/enhancers Binding sites of specific transcription factors confer activation at the right developmental stage or tissue General information: the core promoter Region around the transcription start site (TSS) where RNA polymerase II (pol-II) interacts with general transcription factors Potentially far away from the translation start site

Probabilistic modeling of promoters
Goal: find TSS / proximal promoters ab initio Alternative to cDNA alignments Independent of and in addition to gene prediction Probabilistic modeling allows to deal with uncertainty Models for classes of related sequences Models represent our knowledge about sequences in form of parameters Parameters are automatically estimated using a representative set of sequences Model gives probability of sequence to belong to class, here: promoter or non-promoter (coding, non-coding)

McPromoter system structure

Non-promoter classes: Stationary Markov chains
Probability of a sequence Approximation: Restrict context to the last N symbols (N-th order chain) Markov chain as tree Every node corresponds to a context Contains probability distribution Typical order: 6 (4,096 overall parameters) Variations on Markov chains Variable Order: Leaves on different levels Interpolated: Combination of parameter values from different levels

Promoter model Simple approach: Markov chain model
Better: Take structure into account Generalized hidden Markov model Each state contains a submodel for a specific promoter part, including an explicit length distribution Interpolated Markov chains as submodels Ohler et al., Bioinformatics 1999, PSB 2000

Example: stat6 promoter

Evaluation of ENCODE regions
Similar problem to alternative splicing: alternative transcription start sites Traditionally, the window to count false positives has been very large (e.g., -2,000/+2,000), and close predictions within a large window are merged Evaluate on a per gene basis, i.e. count a true positive if it hits at least one of the annotated TSSs Second problem: False negatives After GASP, counting only those predictions internal to the annotated transcripts is the de facto standard 435 genes / 1,022 different TSSs Another problem: Circularity? (use of Eponine) Reese et al., Genome Res (2000)

Results in the ENCODE region
Standard paramters, NO repeat masking, merging predictions within 2,000 nt: 695 predictions Positive region -2,000/+2,000: 204 TP / 197 genes (sn 47%); FP (sp 73%); unknown More stringent: -500/ TP (sn 39%) 101 FP (sp 63%) Does it make sense to move towards a more detailed evaluation?

Thanks to... Berkeley Drosophila Genome Project
Gerry Rubin Martin Reese Suzi Lewis Erlangen – Institute for Computer Science Heinrich Niemann Stefan Harbeck Georg Stemmer

McPromoter – an ancient tool to predict transcription start sites

Similar presentations

Presentation on theme: "McPromoter – an ancient tool to predict transcription start sites"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

McPromoter – an ancient tool to predict transcription start sites

Similar presentations

Presentation on theme: "McPromoter – an ancient tool to predict transcription start sites"— Presentation transcript:

Similar presentations

About project

Feedback