VectorBase BRC VectorBase annotation metrics Daniel Lawson VectorBase-EBI, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton UK
VectorBase BRC Topics Annotation metrics –Numbers (Gene numbers & xrefs) –Data types (Availability & Integration) Annotation SOPs –Genome specific –Gene specific –Gene build profile & prediction confidence
VectorBase BRC AaegL1.1AgamP3.3YeastWormFlyHuman Gene Gene count16,69113,7657,09821,10514,75231,206 Protein-coding15,419 (92.4 %)13,277 (96.5 %)6,68020,06014,08623,245 other 1,272 ( 7.6 %)488 (3.5 %)4181, ,961 Transcript Transcript count18,06114, Protein-coding16,789 (93.0 %)13,639 (96.5 %)---- other1,272 (7.0 %)488 (3.5 %)---- Manual effort Manually reviewed0 (0.0 %)261 (1.9 %)6,68020,06014,0866,995 Community input0 (0.0 %)667 (4.9 %)4,6847,2289,94516,887 Orthologs Combined11,487 (74.5)9,782 (73.7 %)---- A.aegyptin/a8,907 (67.1 %)2,2024,4167,9916,590 A.gambiae9,923 (54.9 %)n/a2,2284,4447,7026,612 C.elegans4,923 (29.5 %)4,442 (33.4 %)2,185n/a4,5986,121 D.melanogaster9,078 (50.3 %)7,649 (57.6 %)2,2284,543n/a6,654 H.sapiens5,510 (33.0 %)5,046 (38.0 %)2,3264,4735,109n/a S.cerevisiae2,520 (15.1 %)2,350 (17.7 %)n/a2,3492,4703,265 Functional annotation GO terms9,335 (51.7 %)7,601 (55.7 %)4,17611,33410,22617,000 EC numbers2,950 (16.3 %)2,230 (16.4 %)4,103 *5,240 *4,009 *13,245 * InterPro11,536 (74.8 %)9,869 (72.4 %)4,61114,73010,47518,199 Expression evidence Combined12,350 (80.0 %)7,557 (55.4 %)---- cDNA/EST9,270 (60.1 %)7,557 (55.4 %)---- microarray9,143 (59.2 %)†0 (0.0 %)‡---- MPSS3,984 (25.8 %)†n/a----
VectorBase BRC Considerations Importance of calculating all metrics using similar methodology from the same data set Metrics calculated from Ensembl using BioMart & raw SQL queries. GO terms - many ways of calculating (InterPro2GO, projection from Drosophila orthologs) No VectorBase capability to automatically assign EC numbers
VectorBase BRC AaegL1.1AgamP3.3 SequenceYesDownload, search, visualizationYesDownload, search, visualization PolymorphismsNon/aYesSearch, visualization Genetic mapsYesNot integratedYesVisualization Syntenic alignmentYesVisualizationYesVisualization cDNAs & ESTsYesDownload, search, visualizationYesDownload, search, visualization SAGE tagsNon/aNon/a MicroarraysYesVisualizationYesVisualization MPSSYesNot integratedNon/a ProteomicsNon/aNon/a StructuresNon/aNon/a Interactome dataNon/aNon/a PathwaysNon/aNon/a Orthology profilesYesVisualizationYesVisualization Essentiality dataNon/aNon/a
VectorBase BRC VectorBase gene prediction pipeline (SOP) Blessed predictions Community submissionsManual annotations Species-specific predictions Similarity predictions Transcript based predictions Ab initio gene predictions Canonical Gene set VB:SOP001 VB:SOP002 & SOP003 VB:SOP005 VB:SOP004 Protein family HMMs VB:SOP009 ncRNA predictions VB:SOP008 VB:SOP007 VB:SOP010
VectorBase BRC Assignment of SOPs to VectorBase genes: AgamP3.3 SOPNo. genes VB:SOP001Confirmed674 VB:SOP002Protein-based with transcript support 3765 VB:SOP003Protein-based4830 VB:SOP004Transcript-based2857 VB:SOP005Supported ab initio585 VB:SOP006ab initio0 VB:SOP007Manual annotation928
VectorBase BRC Display of Metrics & SOPs Metrics –VectorBase wiki –Species-page containing the three tables available from the VectorBase species homepage –Expansion of documents relating to genomic resources (citations, links to primary data where possible) –Single collated table for BRC as separate download SOPs –VectorBase wiki –‘Documents’ section of main site
VectorBase BRC
10 Manual annotation progress Protein-coding gene No. VectorBase manual Community submission Anopheles gambiae AgamP3.313, ( 2.0 %)667 ( 5.0 %) current2474 (18.6 %)667* ( 5.0 %) Aedes aegypti AaegL1.115,4190 ( 0.0 %) current0 ( 0.0 %)341 ( 2.2 %)
VectorBase BRC Merging gene sets Reduce to single predictions per locus Compare exon/intron structures Gene set #1Gene set #2 Identical structures Compatible structures Different structures Merge/Split structures ComplexNo Map Add isoform predictions based on EST/Peptide data Canonical gene set