Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

Gene Structure Annotation David Swarbreck ASPB Plant Biology, June 29, 2008, Merida.
Making best use of TAIR tools and datasets Philippe Lamesch Donghui Li The Arabidopsis Information Resource contact us:
TAIR: Bringing together data for the global plant biology community Philippe Lamesch Kate Dreher The Arabidopsis Information Resource
GBrowse at TAIR Philippe Lamesch TAIR curator. Seqviewer.
Challenges in Biocuration Philippe Lamesch, PhD Carnegie Institution of Washington Stanford CA.
The Arabidopsis Information Resource (TAIR)
Part I: Tips and Techniques from curators GBrowse at TAIR David Swarbreck.
Part I: Tips and techniques from curators Kate Dreher TAIR, AraCyc, PMN Carnegie Institution for Science.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
EAnnot: A genome annotation tool using experimental evidence Aniko Sabo & Li Ding Genome Sequencing Center Washington University, St. Louis.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
ABSTRACT WormBase is a freely available information resource primarily for the nematode Caenorhabditis elegans but which progressively includes data from.
1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.
Genome analysis and annotation. Genome Annotation Which sequences code for proteins and structural RNAs ? What is the function of the predicted gene products.
Bioinformatics Alternative splicing Multiple isoforms Exonic Splicing Enhancers (ESE) and Silencers (ESS) SpliceNest Lecture 13.
UCSC Known Genes Version 3 Take 10. Overall Pipeline Get alignments etc. from database Remove antibody fragments Clean alignments, project to genome Cluster.
Genome Annotation BCB 660 October 20, From Carson Holt.
NGS Analysis Using Galaxy
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Alternative Splicing. mRNA Splicing During RNA processing internal segments are removed from the transcript and the remaining segments spliced together.
TAIR resources for plant biology research kate dreher curator TAIR/PMN.
The Ensembl Gene set The “Genebuild” 21 April 2008.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
What is comparative genomics? Analyzing & comparing genetic material from different species to study evolution, gene function, and inherited disease Understand.
New data and tools at TAIR (The Arabidopsis Information Resource)
EXPLORING DEAD GENES Adrienne Manuel I400. What are they? Dead Genes are also called Pseudogenes Pseudogenes are non functioning copies of genes in DNA.
Coding Domain Sequence Prediction and Alternative Splicing Detection in Human Malaria Gambiae Jun Li 1, Bing-Bing Wang 2, Jose M. Ribeiro 3, Kenneth D.
NCBI’s Genome Annotation: Overview Incremental processing Re-annotation ( batch ) Post-annotation review Case studies NOTE: limiting discussion to annotation.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Chapter 21 Eukaryotic Genome Sequences
1 Transcript modeling Brent lab. 2 Overview Of Entertainment  Gene prediction Jeltje van Baren  Improving gene prediction with tiling arrays Aaron Tenney.
Fea- ture Num- ber Feature NameFeature description 1 Average number of exons Average number of exons in the transcripts of a gene where indel is located.
Genome Annotation Rosana O. Babu.
Sackler Medical School
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Mark D. Adams Dept. of Genetics 9/10/04
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome reannotation: Dealing with the atypical, the ambiguous, and the contrary.
Comparative Genomics Methods for Alternative Splicing of Eukaryotic Genes Liliana Florea Department of Computer Science Department of Biochemistry GWU.
GNPAnnot Community Annotation System applied to sugarcane BAC clone sequences Valentin GUIGNON PAG Sugarcane Genome Sequencing Initiative Sunday, 16 January.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
JIGSAW: a better way to combine predictions J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and GlimmerHMM: puzzling out the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Advisory Board Meeting, Caltech 2004 Genome Sequence Updates. Paul Davis The Sanger Institute.
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
2006 ICAR: TAIR workshop Organizers: Katica Ilic and Peifen Zhang Location: Reception Room, 4th floor A general overview of TAIR website and demonstration.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Annotation of eukaryotic genomes
Welcome to the combined BLAST and Genome Browser Tutorial.
AceView Danielle and Jean Thierry-Mieg NCBI = global annotation of the whole human genome ● Restricted to the Gencode Regions ●
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
Web Databases for Drosophila
bacteria and eukaryotes
VectorBase genome annotation
Basics of BLAST Basic BLAST Search - What is BLAST?
PlantGDB: Annotation Principles & Procedures
Genome Annotation w/ MAKER
Strategies for annotation of a genome
Part I: Tips and Techniques from curators
Ensembl Genome Repository.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Part II SeqViewer AraCyc Help
Presentation transcript:

Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal

TAIR: An overview Gene function Gene structure Metabolic pathways Debbie Alexander Philippe Lamesch Kate Dreher

ESTs, cDNAs User submissions New release TAIR web Internal TAIR projects Computational pipeline TAIR: An overview Manual annotation

Outline Overview of TAIR8 Data availability Assembly updates Transposable elements Plans for TAIR9 Gene confidence Utilising comparative, proteomic and transcriptome data

TAIR8 Release 33,282 total genes 1291 new genes 50 obsolete genes Merge 41, Split 33 23% (7380) TAIR7 genes updated Source of updates Submission from community (reviewed by TAIR) Manual annotation in-house Computational pipeline (PASA)

Genome Annotation Portal ion_data.jsp

Genome Annotation Portal ion_data.jsp

Sequences and information, TAIR FTP ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release/ Sequences GFF/XML/NCBI.tbl Updates Conversion files Associations

Browse the genome Seqviewer Data types

Browse the genome GBrowse Data types >50 tracks

Changes made for TAIR8 Assembly updates Remove sequence contamination Single base pair errors Addition of Transposable elements

Assembly updates Genome assembly unchanged since TIGR5 (prior to TAIR8) Remove sequence contamination Vector= NCBI VecScreen, Webcutter 2.0 Ecoli = Megablast v Ecoli(nr) Rice = Community Vector/Ecoli = 12 regions Rice = 2 regions Equivalent #Ns substituted 8 genes set to obsolete, 2 modified

Assembly updates Single base pair errors Solexa read data (Columbia) supplied by Joe Eckers Lab (Salk institute) 1425 bases changed called 2 or greater, % of time consensus base is called is >=75%) no minority read support/no ler support Confirmed base changes where overlap current annotation

Assembly updates Single base pair errors 1425 bases changed 157 gene model protein sequences updated 518 had either protein/CDS,mRNA or genomic sequence updated

Assembly updates - GBrowse Gaps

Transposable Elements (TE) & TE-genes 31,060 elements, 339 families, 17 superfamilies Hadi Quesneville Institut Jacques Monod (Buisine et al. Genomics, 2008) Combines evidence from multiple homology-based predictions

HELITRON4 family DNA transposon Unknown pseudogenes Overlapping TEs Protein alignments Transposable Element

HELITRON4 family DNA transposon Unknown pseudogenes Overlapping TEs Protein alignments Transposable Element In TAIR7 pseudogenes and transposable elements all part of pseudogene class no defined transposable element type not all TE-genes have TE descriptions

Identifying TE-genes Categorization as TE-gene By % Overlap with TE (100, >70, >50, below 50) Similarity to set of Known TE-proteins Manual review Additional checks (description, GO terms, publications, transcript evidence) 3900 AGI genes were reclassified (720 previously classed as protein coding)

Transposons & TAIR TE given ID AT2TE ,189 TEs, 3900 TE-genes

Transposons & TAIR

Plans for TAIR9

Gene confidence score Why assign a confidence score? Differentiates well supported, partially supported and non-supported models Allows TAIR users to target particular categories For further experimentation For use as a reference set For computational analysis Allows TAIR to target partially supported genes Provides a measure with which to monitor improvement

Gene confidence outline Categories of evidence Transcript (cDNA/EST) Protein Conservation Proteomic data Transcriptome data (MPSS etc) Rankings within category Assign confidence score/rank to model + exons

Transcript exon rankings - internal Splice sites confirmed by transcript Transcript only overlaps exon Intermediates

Transcript Model rankings Intermediates

Gene confidence outline Provide evidence ranks on web pages/GFF Transcript (cDNA/EST)7 Protein2 Conservation2 Proteomic data0 Transcriptome data (MPSS etc)0 Include overall rank (incorporating all evidence) Associate general description to each overall rank e.g. Confirmed, partially confirmed or Platinum, Gold, Silver etc Exon ranks included in GFF file Rank

Improving genome annotation: a collective approach Gene confidence score Possible misannotated genes

Improving genome annotation: a collective approach Alternative gene models: - Gnomon - Aceview - Eugene - Hanada et al Gene structure updates Alternative splice variants Possible misannotated genes

Improving genome annotation: a collective approach Update TSS Possible misannotated genes PlantPromoter elements Yamamoto et al

Improving genome annotation: a collective approach Update gene on translational level Possible misannotated genes Proteomics data Incorrect start codon Baerenfaller et al

Improving genome annotation: a collective approach Identify missing exons/genes Possible misannotated genes Cross-species sequence conservation VISTA plots (Dubchak Lab)

A collective approach Gene confidence, identify weakly supported genes Utilise alt. gene predictions, comparative alignments, transcriptome and proteomic data Combined manual and computational approach