McPromoter – an ancient tool to predict transcription start sites

Slides:



Advertisements
Similar presentations
GS 540 week 5. What discussion topics would you like? Past topics: General programming tips C/C++ tips and standard library BLAST Frequentist vs. Bayesian.
Advertisements

BIOINFORMATICS GENE DISCOVERY BIOINFORMATICS AND GENE DISCOVERY Iosif Vaisman 1998 UNIVERSITY OF NORTH CAROLINA AT CHAPEL HILL Bioinformatics Tutorials.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Ab initio gene prediction Genome 559, Winter 2011.
Chapter 3 Ying Xu. Total numbers of occurrences of X in coding and noncoding regions. Relative frequency (RF)of X in coding regions = number of.
Gene Finding BCH364C/391L Systems Biology / Bioinformatics – Spring 2015 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BCH364C-391L/Spring.
درس بیوانفورماتیک December 2013 مدل ‌ مخفی مارکوف و تعمیم ‌ های آن به نام خدا.
Reese, E-GASP Short comparion GASP ‘99- EGASP ‘05 Martin Reese Omicia Inc Horton Street Emeryville, CA
A turbo intro to (the bioinformatics of) microRNAs 11/ Peter Hagedorn.
Transcriptomics Jim Noonan GENE 760.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
HMM Sampling and Applications to Gene Finding and Alignment European Conference on Computational Biology 2003 Simon Cawley * and Lior Pachter + and thanks.
Non-coding RNA William Liu CS374: Algorithms in Biology November 23, 2004.
Comparative ab initio prediction of gene structures using pair HMMs
Eukaryotic Gene Finding
Phylogenetic Shadowing Daniel L. Ong. March 9, 2005RUGS, UC Berkeley2 Abstract The human genome contains about 3 billion base pairs! Algorithms to analyze.
Promoter Analysis using Bioinformatics, Putting the Predictions to the Test Amy Creekmore Ansci 490M November 19, 2002.
Lecture 12 Splicing and gene prediction in eukaryotes
Deepak Verghese CS 6890 Gene Finding With A Hidden Markov model Of Genomic Structure and Evolution. Jakob Skou Pedersen and Jotun Hein.
Eukaryotic Gene Finding
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
International Livestock Research Institute, Nairobi, Kenya. Introduction to Bioinformatics: NOV David Lynn (M.Sc., Ph.D.) Trinity College Dublin.
Comparative Genomics & Annotation The Foundation of Comparative Genomics The main methodological tasks of CG Annotation: Protein Gene Finding RNA Structure.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Gene Finding BIO337 Systems Biology / Bioinformatics – Spring 2014 Edward Marcotte, Univ of Texas at Austin Edward Marcotte/Univ. of Texas/BIO337/Spring.
Gene finding with GeneMark.HMM (Lukashin & Borodovsky, 1997 ) CS 466 Saurabh Sinha.
Intelligent Systems for Bioinformatics Michael J. Watts
ModENCODE August 20-21, 2007 Drosophila Transcriptome: Aim 2.2.
Statistical Analysis for Word counting in Drosophila Core Promoters Yogita Mantri April Bioinformatics Capstone presentation.
10/29/20151 Gene Finding Project (Cont.) Charles Yan.
Sackler Medical School
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Interpolated Markov Models for Gene Finding BMI/CS 776 Mark Craven February 2002.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
EB3233 Bioinformatics Introduction to Bioinformatics.
Complexities of Gene Expression Cells have regulated, complex systems –Not all genes are expressed in every cell –Many genes are not expressed all of.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Markov Chain Models BMI/CS 576 Colin Dewey Fall 2015.
Applications of HMMs in Computational Biology BMI/CS 576 Colin Dewey Fall 2010.
Motif Search and RNA Structure Prediction Lesson 9.
Applied Bioinformatics
Hidden Markov Model and Its Application in Bioinformatics Liqing Department of Computer Science.
Genome Annotation Assessment in Drosophila melanogaster by Reese, M. G., et al. Summary by: Joe Reardon Swathi Appachi Max Masnick Summary of.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Building Excellence in Genomics and Computational Bioscience miRNA Workshop: miRNA biogenesis & discovery Simon Moxon
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
Genome Annotation (protein coding genes)
EGASP 2005 Evaluation Protocol
Gene expression from RNA-Seq
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
Prediction of Regulatory Elements for Non-Model Organisms Rachita Sharma, Patricia.
Ab initio gene prediction
Genome organization and Bioinformatics
Network Inference Chris Holmes Oxford Centre for Gene Function, &,
Modeling of Spliceosome
Schematic representation of a transcriptomic evaluation approach.
Presentation transcript:

McPromoter – an ancient tool to predict transcription start sites Uwe Ohler uwe.ohler@duke.edu Institute for Genome Sciences and Policy Duke University (BDGP/Univ Erlangen)

An extremely simplified view of eukaryotic transcription Specific information about functional context of genes: proximal promoter/enhancers Binding sites of specific transcription factors confer activation at the right developmental stage or tissue General information: the core promoter Region around the transcription start site (TSS) where RNA polymerase II (pol-II) interacts with general transcription factors Potentially far away from the translation start site

Probabilistic modeling of promoters Goal: find TSS / proximal promoters ab initio Alternative to cDNA alignments Independent of and in addition to gene prediction Probabilistic modeling allows to deal with uncertainty Models for classes of related sequences Models represent our knowledge about sequences in form of parameters Parameters are automatically estimated using a representative set of sequences Model gives probability of sequence to belong to class, here: promoter or non-promoter (coding, non-coding)

McPromoter system structure

Non-promoter classes: Stationary Markov chains Probability of a sequence Approximation: Restrict context to the last N symbols (N-th order chain) Markov chain as tree Every node corresponds to a context Contains probability distribution Typical order: 6 (4,096 overall parameters) Variations on Markov chains Variable Order: Leaves on different levels Interpolated: Combination of parameter values from different levels

Promoter model Simple approach: Markov chain model Better: Take structure into account Generalized hidden Markov model Each state contains a submodel for a specific promoter part, including an explicit length distribution Interpolated Markov chains as submodels Ohler et al., Bioinformatics 1999, PSB 2000

Example: stat6 promoter http://genes.mit.edu/McPromoter.html

Evaluation of ENCODE regions Similar problem to alternative splicing: alternative transcription start sites Traditionally, the window to count false positives has been very large (e.g., -2,000/+2,000), and close predictions within a large window are merged Evaluate on a per gene basis, i.e. count a true positive if it hits at least one of the annotated TSSs Second problem: False negatives After GASP, counting only those predictions internal to the annotated transcripts is the de facto standard 435 genes / 1,022 different TSSs Another problem: Circularity? (use of Eponine) Reese et al., Genome Res (2000)

Results in the ENCODE region Standard paramters, NO repeat masking, merging predictions within 2,000 nt: 695 predictions Positive region -2,000/+2,000: 204 TP / 197 genes (sn 47%); 77 FP (sp 73%); 414 unknown More stringent: -500/+500 169 TP (sn 39%) 101 FP (sp 63%) Does it make sense to move towards a more detailed evaluation?

Thanks to... Berkeley Drosophila Genome Project Gerry Rubin Martin Reese Suzi Lewis Erlangen – Institute for Computer Science Heinrich Niemann Stefan Harbeck Georg Stemmer