Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute.

Slides:



Advertisements
Similar presentations
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Advertisements

A Novel Knowledge Based Method to Predicting Transcription Factor Targets
McPromoter – an ancient tool to predict transcription start sites
Protein structure (Part 2 of 2).
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
The Protein Data Bank (PDB)
Defining the Regulatory Potential of Highly Conserved Vertebrate Non-Exonic Elements Rachel Harte BME230.
Eukaryotic Gene Finding
1 Predicting Gene Expression from Sequence Michael A. Beer and Saeed Tavazoie Cell 117, (16 April 2004)
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
NGS Analysis Using Galaxy
Gene Structure and Identification
Protein Tertiary Structure Prediction
Genome Informatics 2005 ~ 220 participants 1 keynote speaker: David Haussler 47 talks 121 posters.
Similarity searching modell with Excel Zoltán Varga PhD student SZIU.
Tomato genome annotation pipeline in Cyrille2
Semantic Similarity over Gene Ontology for Multi-label Protein Subcellular Localization Shibiao WAN and Man-Wai MAK The Hong Kong Polytechnic University.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
* only 17% of SNPs implicated in freshwater adaptation map to coding sequences Many, many mapping studies find prevalent noncoding QTLs.
Grupo 5. 5’site 3’site branchpoint site exon 1 intron 1 exon 2 intron 2 AG/GT CAG/NT.
Supplementary Figure S1 Percentage of peaks from Trf1 +/+ p53 -/- -Cre vs Trf1  /  p53 -/- -Cre comparison that are located in non subtelomeric and subtelomeric.
1 Classification of real and pseudo microRNA precursors using local structure-sequence features and support vector machine Chenghai Xue, Fei Li, Tao He,
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Today Ensemble Methods. Recap of the course. Classifier Fusion
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Comparative genomics analysis of NtcA regulons in cyanobacteria: Regulation of nitrogen assimilation and its coupling to photosynthesis Wen-Ting Huang.
Advancing Science with DNA Sequence Finding the genes in microbial genomes Natalia Ivanova MGM Workshop May 15, 2012.
Exogean: an expert gene annotation framework based on directed acyclic coloured multigraphs ENCODE Gene Prediction Workshop - EGASP/2005 Sarah Djebali,
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Fprom promoter predictions Victor Solovyev & Igor Seledtsov Royal Holloway College, University of London Softberry Inc.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
From Genomes to Genes Rui Alves.
Gene Regulatory Networks and Neurodegenerative Diseases Anne Chiaramello, Ph.D Associate Professor George Washington University Medical Center Department.
On the Role of Dataset Complexity in Case-Based Reasoning Derek Bridge UCC Ireland (based on work done with Lisa Cummins)
How do we represent the position specific preference ? BID_MOUSE I A R H L A Q I G D E M BAD_MOUSE Y G R E L R R M S D E F BAK_MOUSE V G R Q L A L I G.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
How can we find genes? Search for them Look them up.
De novo assembly validation
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Gene Structure and Identification III BIO520 BioinformaticsJim Lund Previous reading: 1.3, , 10.4,
MicroRNA Prediction with SCFG and MFE Structure Annotation Tim Shaw, Ying Zheng, and Bram Sebastian.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Plasmodium falciparum (3D7) - published in Draft coverage. No sequence updates for a year. No new annotation since? Leishmania major Friedlin - version.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Information Organization: Evaluation of Classification Performance.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Gene prediction in metagenomic fragments: A large scale machine learning approach Katharina J Hoff, Maike Tech, Thomas Lingner, Rolf Daniel, Burkhard Morgenstern.
Bioinformatics Computing 1 CMP 807 – Day 4 Kevin Galens.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Regulation of Gene Expression
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
Logistic Regression: To classify gene pairs
Disease risk prediction
Experimental Verification Department of Genetic Medicine
Very important to know the difference between the trees!
Volume 5, Issue 3, Pages (March 2016)
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
Introduction to Bioinformatics II
Volume 10, Issue 7, Pages (July 2017)
Nonspecific Protein-DNA Binding Is Widespread in the Yeast Genome
Volume 17, Issue 6, Pages (November 2016)
Human Promoters Are Intrinsically Directional
Information Organization: Evaluation of Classification Performance
Presentation transcript:

Promoter prediction assessment by Vladimir B Bajic ENCODE Workshop 2005 at Sanger Institute

Predictors ENCODE participants (3): (McPromoter1) (McPromoter2) (Fprom) additional predictors Beyond ENCODE participants (4) (out of competition) DBTSS (reference experimental dataset of capped flcDNA) FirstEF Dragon Gene Start Finder Dragon Promoter Finder

Goals How good are promoter predictors? Does performance change on this dataset? Implications for future developments

Data (1) Category “Known genes with CDS” (category = 2) 1061 annotated transcripts > 994 unique starts of transcripts (TSSs) 319 unique TSSs in Encode ‘training’ set (13 regions) 675 unique TSSs in Encode test set Length of ENCODE regions 29,998,060 bp Length of ‘training’ regions 8,538,447 bp Length of testing regions 21,459,613 bp

Data (2) programspredictionsunique McPromoter1694 McPromoter2727 Fprom (combined with gene annotation from Fgenesh) DBTSS (exp) FEF1266 DPF2168 DGSF628

Method for counting TP and FP All hits to ‘orange’ count as FPs Only one hit within A, B, or C counts as TP for unique position of TSS (3 hits within C will count only as 1 TP) Only minimum distance from all TSSs counts

Results Different measures of success Test ENCODE regions Also: comparison with other participants (test + all regions)

Se, ppv, AE (average positional error)

DIP1, DIP2, CC, ASM

Comments Compared to previous whole human genome analysis, now we use a more strict distance constraint: max allowed distance 1000 nt (vs. previous 2000 nt) Previously: Se [0.4 – 0.8], ppv [0.25 – 0.67] Now, for experimental DBTSS data: –Positional error ~100 nt, Se 0.61, ppv 0.93 Computational promoter prediction (CPP) (using single genome, no transcripts): positional error nt (2-3 fold larger than DBTSS) (positive surprise) Se [ ] (negative surprise but expected) –(reason poor G+C content of some of the test regions) CPP: ppv >80 (in some cases >90%) (positive surprise) Having in mind the type of information used for ab initio promoter finding, we see no dramatic difference in 5’ end prediction by methods class 1 and 3, and CPP (positive surprise); however, Se and ppv are better with methods of class 1 and class 3 for obvious reasons.

Future developments Combine TSS predictors and gene finding programs or transcript info (positive effects of this are visible in Fprom, and , since in these cases the TSS search space is effectively restricted) This, however, requires retuning of TSS predictors and some change in their design philosophy Expected performance should be similar or better than in class 1 and class 3 systems as TSS finding systems should be more specialized for the 5’end type of signals More emphasis should be given to positional accuracy of TSS predictors

Thank you for your time You may wake up now