EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.

Slides:



Advertisements
Similar presentations
Introduction 1.Ordering of P. knowlesi contigs v P. falciparum methodology progress/status towards a synteny map – ‘true’ scaffold 2. Gene prediction generating.
Advertisements

ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
Peter Tsai, Bioinformatics Institute.  University of California, Santa Cruz (UCSC)  A rapid and reliable display of any requested portion of genomes.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.
The Molecular Genetics of Gene Expression
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
UCSC Known Genes Version 3 Take 9. Known Gene History Initially based on Genie predictions constrained by BLAT mRNA alignments. –David Kulp got busy at.
Genome Assembly and Annotation Erik Arner Omics Science Center, RIKEN Yokohama, Japan
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
The Influence of Alternative Splicing in Protein Structure The fact that gene number is not significantly different between mammals and some invertebrates.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome Annotation BCB 660 October 20, From Carson Holt.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
The Ensembl Gene set The “Genebuild” 21 April 2008.
ENCODE pseudogene updates Adam Frankish, HAVANA 6/10/05.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
Genome Annotation BBSI July 14, 2005 Rita Shiang.
05/04/2005 Informatics Meeting C. elegans – “Back To The Future”. Paul Davis (aka Huey)
 GEP Digital Laboratory Notebook Nick Reeves, Mt. San Jacinto Community College.
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
Common Errors in Student Annotation Submissions contributions from Paul Lee, David Xiong, Thomas Quisenberry Annotating multiple genes at the same locus.
Analysis of the RNAseq Genome Annotation Assessment Project by Subhajyoti De.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Sackler Medical School
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
The Havana-Gencode annotation GENCODE CONSORTIUM.
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Bioinformatics and Computational Biology
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Annotation of Drosophila virilis Chris Shaffer GEP workshop, 2006.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Fgenes++ pipelines for automatic annotation of eukaryotic genomes Victor Solovyev, Peter Kosarev, Royal Holloway College, University of London Softberry.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ENCODE pseudogene updates Adam Frankish, HAVANA 13/10/05.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Accessing and visualizing genomics data
Annotation of eukaryotic genomes
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Gene Finding in Chimpanzee Evidence based improvement of ab initio gene predictions Chris Shaffer06/2009.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
Visualization of genomic data Genome browsers. How many have used a genome browser ? UCSC browser ? Ensembl browser ? Others ? survey.
Visualization of genomic data Genome browsers. UCSC browser Ensembl browser Others ? Survey.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
bacteria and eukaryotes
EGASP 2005 Evaluation Protocol
Lesson Four Structure of a Gene.
Lesson Four Structure of a Gene.
EGASP 2005 Evaluation Protocol
Experimental Verification Department of Genetic Medicine
Visualization of genomic data
Gene Annotation with DNA Subway
Introduction to Bioinformatics II
Gene Expression Practice Test
Introduction to Alternative Splicing and my research report
Basic Local Alignment Search Tool
Presentation transcript:

EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis

Challenge…. Manual annotation of human chromosomes 2 and 4 Overwhelming amount of expression sequence data for annotators to review

EAnnot = Electronic Annotation Created to aid manual annotation by removing the most time consuming and repetitive tasks: –Initial creation of gene models –Evidence attachment –Evaluating CDS translation –Locus information addition Why was EAnnot created?

INPUT: mRNA, EST, protein alignments STEP 1: Gene boundaries created based on strand assignment, sequence overlap, clone linking STEP 2: mRNAs and ESTs clustered, gene models created, Exon/intron boundaries fine tuned using splice table STEP 3: gene models evaluated, corrected based on protein data STEP 4 OUTPUT: annotated gene models How does EAnnot work? INPUT: Genomic sequence (clones, contigs, chromosomes)

STEP 1: Gene boundaries created based on strand assignment, sequence overlap, clone linking ESTs do not overlap Paired end reads Gene boundaries Same strand, sequences overlap Clone linking

STEP 2: mRNA and EST clustering, gene models created Multiple EST and mRNA alignmentsgene models

3’ STOP Frame shift STEP 3: gene models evaluated, corrected based on protein data Gene model translation is compared with matching protein from GenBank. If there is discrepancy EAnnot tries to adjust gene model to resolve frame shifts, insertions and deletions. * DNA Translation

STEP 4: OUTPUT: gene models Expression sequence data Gene models

STEP 4: gene models annotated Supporting evidence Protein EST mRNA Locus information

Unresolved problems with CDS are placed in remark field for the annotators

PolyA signal and site annotation spliced and non-spliced ESTs and mRNAs with PolyA tail The presence of a polyA site/signal in non-spliced ESTs is additional evidence for putative genes PolyA signal PolyA site

EAnnot performance evaluation Human chromosome 6 annotation (Sanger) Manual annotation: 1557 genes, 3271 transcripts EAnnot annotation: 1724 genes, 5266 transcripts Gene level: 87% manually annotated genes overlap EAnnot genes 20% EAnnot don’t overlap manual Splice site level: sensitivity 86%, specificity 86% EAnnot can be a good stand alone annotation tool

Comparison with chr6 manual annotation Eannot gene models the same as manually annotated

Comparison with chr6 manual annotation Rat mRNA did not pass threshold Eannot split gene model Manual annotation used rat mRNA

Comparison with chr6 manual annotation Eannot missed supporting EST did not pass threshold

Comparison with chr6 manual annotation Eannot created additional splice form

Using EAnnot in annotation of non-human genomes: Example Histoplasma capsulatum Organism specific expression data not abundant in GenBank Issues Strategies Use all available data Gene stitching, merging data Average homology low Lower identity and gap thresholds Genes different than vertebrate genes; large exons, small introns Lower gene and intron size parameter Splice variants Splice variants based on organism specific expression data Splice consensus preference Organism specific splice table

Merged model Protein based models Histoplasma EST based model Merging depends on the type and quality of the underlying data

Manual annotation: EAnnot saves time by creating gene models and attaching information (supporting evidence, CDS evaluation, locus) Increases accuracy and consistency EAnnot can be used as stand alone gene prediction tool Future: other formats in addition to AceDB

GSC annotation group: Aniko Sabo Li Ding Rekha Meyer Tamberlyn Bieri Phil Ozersky Nicolas Berkowicz LaDeana Hillier Kym Pepin John Spieth

Annotates pseudogenes based on RefSeq locus link information and fish banding patterns