VectorBase genome annotation

Slides:



Advertisements
Similar presentations
Model Organism Databases and Community Annotation
Advertisements

Gene Structure Annotation Philippe Lamesch International Arabidopsis conference July 23, 2008, Montreal.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
Homology Based Analysis of the Human/Mouse lncRNome
Walk-thru of CAGE exercise Also at /tag_analysis/ /tag_analysis/
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
BRC6 28 th October 2008 Collective annotation of the Ixodes scapularis genome: VectorBase, MSCs and the tick community. Daniel Lawson, VectorBase.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton.
Genome Annotation BCB 660 October 20, From Carson Holt.
Gene Finding Genome Annotation. Gene finding is a cornerstone of genomic analysis Genome content and organization Differential expression analysis Epigenomics.
Doug Brutlag 2011 Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University School of Medicine Genomics, Bioinformatics.
Doug Brutlag Professor Emeritus Biochemistry & Medicine (by courtesy) Genome Databases Computational Molecular Biology Biochem 218 – BioMedical Informatics.
Doug Brutlag 2011 Next Generation Sequencing and Human Genome Databases Doug Brutlag Professor Emeritus of Biochemistry & Medicine Stanford University.
1 of 34 Ensembl use of RNASeq Steve Searle. 2 of 34 Ways we use RNASeq data in Ensembl: Build complete gene set from scratch for individual or pooled.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
VectorBase A Resource Centre for Invertebrate Hosts of Human Pathogens Bob MacCallum Imperial College London.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Kerstin Howe, Mario Caccamo, Ian Sealy The Zebrafish Genome Sequencing Project Bioinformatics resources.
Genome Annotation and Databases Genomic DNA sequence Genomic annotation BIO520 BioinformaticsJim Lund Reading Ch 9, Ch10.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
MAKER Annotation Process Example of Glossina VectorBase Karyn Mégy Dan Hughes.
Web Apollo and the VectorBase user community Gloria I. Giraldo-Calderón March 31, 2015.
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
VectorBase BRC The evolving VectorBase gene build: mixing automated and manual approaches when annotating vector genomes Daniel Lawson VectorBase-EBI,
Genome Annotation Rosana O. Babu.
Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Importing Community annotations into VectorBase. Aims Provide the VectorBase community with tools for improving genome annotation. Must have low entry.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Curation Tools Gary Williams Sanger Institute. SAB 2008 Gene curation – prediction software Gene prediction software is good, but not perfect. Out of.
August 2008Bioinformatics Tools for Comparative Genomics of Vectors1 Genomes Daniel Lawson EBI.
VectorBase Vectorbase probe mapping. VectorBase Automatic Annotation browser Array data CHADO Manual Annotation XML vectorbase Automatic Annotation.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
August 2008Bioinformatics tools for Comparative Genomics of Vectors1 Genome Annotation Daniel Lawson EBI.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Sequence Curation Paul Davis Sanger Institute. Overview Sequence curation within WormBase consortium. Import of sequence data. Prediction stats. Work.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Basics of Genome Annotation Daniel Standage Biology Department Indiana University.
Daphnia Genome Annotation & Analysis Notes July 2007 Don Gilbert Genome Informatics Lab, Biology Dept., Indiana University
GENCODE: a rich dataset of all gene features in the human genome The GENCODE consortium aims to identify all gene features in the human genome, using a.
Web Databases for Drosophila
bacteria and eukaryotes
Annotating The data.
Introduction to Genes and Genomes with Ensembl
The Transcriptional Landscape of the Mammalian Genome
Bioinformatics Tools for Comparative Genomics of Vectors
ENCODE Pseudogenes and Transcription
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
PlantGDB: Annotation Principles & Procedures
GO Annotation from different sources
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Genome Annotation w/ MAKER
Introduction to Bioinformatics II
Strategies for annotation of a genome
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Ensembl Genome Repository.
Next Generation Sequencing and Human Genome Databases
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
2 Unité de Biométrie et d’Intelligence Artificielle (UBIA) INRA
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Part II SeqViewer AraCyc Help
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

VectorBase genome annotation VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK VectorBase SWG 2006

Overview of current annotation system Assembled genome Sequencing centre gene predictions VectorBase gene predictions Merge into canonical set Protein analysis Assembled genome - Changes to the assembly require significant changes to the annotation process - raw computes, transfer of existing annotations/features Sequencing centre gene set - tendency to overpredict VectorBase gene set - tendency to underpredict Display on genome browser Release to GenBank/EMBL/DDBJ VectorBase SWG 2006

Merging gene sets Gene set #1 Gene set #2 Reduce to single predictions per locus Compare exon/intron structures Identical structures Compatible structures Different structures Merge/Split structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set VectorBase SWG 2006

Data types used for gene prediction/validation Protein sequences ‘Self’ (i.e. species to be predicted) Taxonomic splits of UniprotKB Transcript sequences mRNAs ESTs Evidence of expression Microarray SAGE tags Ditags MPSS - Massively Parallel Signature Sequencing - immobilize short DNA fragments on millions of beads, then sequence these oligomers in parallel (I.e. fast). Claims to be able to count 1 million mRNAs at a time MPSS Proteomics data Sequence statistics Coding potential Splice site prediction VectorBase SWG 2006

VectorBase gene prediction pipeline Blessed predictions Manual annotations Community submissions (Apollo) (Genewise, Exonerate, Apollo) Similarity predictions Species-specific predictions Canonical predictions (Genewise) (Genewise) Protein family HMMs ncRNA predictions (Genewise) (Rfam) Transcript based predictions Ab initio gene predictions (Exonerate) (SNAP) VectorBase SWG 2006

VectorBase curation database pipeline for manual/community annotation Community annotation (Community representatives) Manual annotation (Harvard) Curation warehouse db Chado-XML Chado-XML Apollo Chado Apollo Community annotation (in collaboration with Harvard) GFF3 Ensembl Gene build db VectorBase SWG 2006

Overview of current re-annotation system Full gene build Partial Gene build New gene build Blessed genes Species-specific gene prediction Current gene set Compare Merge Updated gene set VectorBase SWG 2006

Comparing new gene builds with the old one Use of manual annotation for validation of automated gene build improvements Simple statistics (CDS length, intron size, CDS matching TE’s) BRC annotation metrics Supporting evidence for a gene prediction (citation,expression,orthology) Attachment of Standard Operating Procedures (SOPs) VectorBase SWG 2006

Gene build schedules Triggers for re-annotation Temporal Data Full gene build Triggers for re-annotation Temporal Data New EST data for species New genomes Re-annotated genomes 4 months 1 month Partial gene build Temporal - It’s been a while (6 monthly, yearly) Data - New transcript data = better predictions - New genomes within claid, comparative approaches - Re-annotated genomes, e.g. new ESTs in Aedes influences prediction changes which affect Anopheles entries, GPCRs in nematodes VectorBase SWG 2006

VectorBase annotation capacity with increased number of genomes Gene builds per year per genome 2 full 1 full 1full 1 full 2 partial 3 partial 2 partial 1 partial 2 genomes Yes Yes Yes Yes 3 genomes Yes Yes Yes Yes 4 genomes No Yes Yes Yes 5 genomes No Yes Yes Yes 6 genomes No No Yes Yes 7 genomes No No No Yes 8 genomes No No No No VectorBase SWG 2006

Re-annotation questions Triggers for re-annotation Strict temporal triggers Always do a full gene build every year? Data triggers How much new data is enough? Knock-on effects of related species (re)annotation? Encouraging community submissions How can we get more community annotation input? Outreach at conferences (Roadshow) VectorBase SWG 2006