Download presentation
Presentation is loading. Please wait.
1
VectorBase genome annotation
VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK VectorBase SWG 2006
2
Overview of current annotation system
Assembled genome Sequencing centre gene predictions VectorBase gene predictions Merge into canonical set Protein analysis Assembled genome - Changes to the assembly require significant changes to the annotation process - raw computes, transfer of existing annotations/features Sequencing centre gene set - tendency to overpredict VectorBase gene set - tendency to underpredict Display on genome browser Release to GenBank/EMBL/DDBJ VectorBase SWG 2006
3
Merging gene sets Gene set #1 Gene set #2
Reduce to single predictions per locus Compare exon/intron structures Identical structures Compatible structures Different structures Merge/Split structures Complex No Map Add isoform predictions based on EST/Peptide data Canonical gene set VectorBase SWG 2006
4
Data types used for gene prediction/validation
Protein sequences ‘Self’ (i.e. species to be predicted) Taxonomic splits of UniprotKB Transcript sequences mRNAs ESTs Evidence of expression Microarray SAGE tags Ditags MPSS - Massively Parallel Signature Sequencing - immobilize short DNA fragments on millions of beads, then sequence these oligomers in parallel (I.e. fast). Claims to be able to count 1 million mRNAs at a time MPSS Proteomics data Sequence statistics Coding potential Splice site prediction VectorBase SWG 2006
5
VectorBase gene prediction pipeline
Blessed predictions Manual annotations Community submissions (Apollo) (Genewise, Exonerate, Apollo) Similarity predictions Species-specific predictions Canonical predictions (Genewise) (Genewise) Protein family HMMs ncRNA predictions (Genewise) (Rfam) Transcript based predictions Ab initio gene predictions (Exonerate) (SNAP) VectorBase SWG 2006
6
VectorBase curation database pipeline for manual/community annotation
Community annotation (Community representatives) Manual annotation (Harvard) Curation warehouse db Chado-XML Chado-XML Apollo Chado Apollo Community annotation (in collaboration with Harvard) GFF3 Ensembl Gene build db VectorBase SWG 2006
7
Overview of current re-annotation system
Full gene build Partial Gene build New gene build Blessed genes Species-specific gene prediction Current gene set Compare Merge Updated gene set VectorBase SWG 2006
8
Comparing new gene builds with the old one
Use of manual annotation for validation of automated gene build improvements Simple statistics (CDS length, intron size, CDS matching TE’s) BRC annotation metrics Supporting evidence for a gene prediction (citation,expression,orthology) Attachment of Standard Operating Procedures (SOPs) VectorBase SWG 2006
9
Gene build schedules Triggers for re-annotation Temporal Data
Full gene build Triggers for re-annotation Temporal Data New EST data for species New genomes Re-annotated genomes 4 months 1 month Partial gene build Temporal - It’s been a while (6 monthly, yearly) Data - New transcript data = better predictions - New genomes within claid, comparative approaches - Re-annotated genomes, e.g. new ESTs in Aedes influences prediction changes which affect Anopheles entries, GPCRs in nematodes VectorBase SWG 2006
10
VectorBase annotation capacity with increased number of genomes
Gene builds per year per genome 2 full 1 full 1full 1 full partial 3 partial 2 partial 1 partial 2 genomes Yes Yes Yes Yes 3 genomes Yes Yes Yes Yes 4 genomes No Yes Yes Yes 5 genomes No Yes Yes Yes 6 genomes No No Yes Yes 7 genomes No No No Yes 8 genomes No No No No VectorBase SWG 2006
11
Re-annotation questions
Triggers for re-annotation Strict temporal triggers Always do a full gene build every year? Data triggers How much new data is enough? Knock-on effects of related species (re)annotation? Encouraging community submissions How can we get more community annotation input? Outreach at conferences (Roadshow) VectorBase SWG 2006
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.