European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder
European Bioinformatics Institute Contents Introduction to GOA Manual GOA annotation Electronic annotation: –InterPro2GO GOA data flow Uses of GOA Future plans
European Bioinformatics Institute What is GO annotation? An annotation is a statement that a gene product has a particular molecular function is involved in a particular biological process is located within a certain cellular component …as determined by a particular method …as described in a particular reference. GO Term ID GO Term ID Evidence Code Evidence Code Reference
European Bioinformatics Institute Gene Ontology Annotation (GOA) Database GOA’s priority is to annotate the human, mouse and rat proteomes Largest open-source contributor of annotations to GO Provides 10 million annotations for more than 111,000 species Share and integrate GO annotation
European Bioinformatics Institute How do we annotate GO terms Manual Annotation Electronic Annotation All annotations must: be attributed to a source indicate what evidence was found to support the GO term-gene/protein association
European Bioinformatics Institute Manual annotation High quality Specific gene or gene product associations made using: –Peer reviewed papers –Evidence codes BUT: –Time-consuming –Requires trained biologists
European Bioinformatics Institute Manual GO annotation Read papersFind GO termAnnotate to protein GOA-association file Oracle RDBMS Pubmed ID, Evidence code GO and EBI ftp sites
European Bioinformatics Institute Protein2GO tool Online
European Bioinformatics Institute Information captured by GOA SourceGOIDTermEvidRefDB RefIDWith DBWith IDQualifier
European Bioinformatics Institute How successful is manual-GOA ? SourceNo. of annotationsNo. of distinct proteins Proteome Inc UniProt IntAct MGI SGD FlyBase RGD HGNC GeneDB TAIR/TIGR ZFIN Roslin Institute146 AgBase Reactome1512 WormBase TIGR13979 Gramene GDB TOTAL MANUAL July taxa
European Bioinformatics Institute Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings Get IEA evidence code Electronic Annotation Curated mapping e.g. EC: > GO:alcohol dehydrogenase activity ; GO: UniProt KeywordHAMAPInterProEC GO Curated or electronic rule based mappings High quality electronic protein to GO associations
European Bioinformatics Institute
European Bioinformatics Institute Mappings of external concepts to GO
European Bioinformatics Institute InterPro2GO mapping InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc. It provides a means of classifying proteins into families and identifying domains. Each InterPro entry groups proteins belonging to the same family and potentially having the same function
European Bioinformatics Institute InterPro2Go mapping Done manually, but using tools Look at InterPro and protein annotation For all Swiss-Prot proteins matching entry truly: –Get stats on DE lines, keywords, comments –Check how conserved common annotation is –Find appropriate GO term at most specific level that applies to all proteins (not necessarily domains)
European Bioinformatics Institute Tools used –”SQUID” Statistics options: keyword description Gene name Organism Comments, etc.
European Bioinformatics Institute SQUID statistics output
European Bioinformatics Institute SQUID statistics output
European Bioinformatics Institute InterPro2GO mapping in entry
European Bioinformatics Institute InterProScan output with GO terms
European Bioinformatics Institute InterPro2GO sanity checks Run weekly Reports: Obsolete GO terms Obsolete (deleted) IPRs Secondary IPRs
European Bioinformatics Institute Quality of GO mapping BioCreAtIvE test set -635 GO annotations through InterPro2GO Camon et al., 2005, BMC Bioinformatics Manually checked 44 proteins, 107 predictions: 97 correct (90%): -40 exact -57 same lineage 10 new lineage (unknown) 0 incorrect Exact term15124% Same lineage < granularity27343% Same lineage > granularity244% New lineage18729% Minimal correct42467% Potentially incorrect21133% Precision %
European Bioinformatics Institute InterPro2GO mapping statistics Total no. IPRS mapped to GO7126 % of IPRs mapped to at least 1 GO term54% No. IPRS mapped to molecular function5741 No. IPRS mapped to biological process5543 No. IPRS mapped to cellular component3426 No. GO terms mapped2811 No. UniProt proteins mapped through interpro2go (61%) % UniProt covered by InterPro77.6%
European Bioinformatics Institute Provides large coverage High Quality However these annotations often use high-level GO terms and provide little detail. How successful is IEA-GOA in general? IEA MethodNo. of annotationsNo. of distinct proteins InterPro2GO HAMAP2GO SP Keyword2GO EC2GO TOTAL Jun 2006 Manual ones:
European Bioinformatics Institute Total GO statistics Total no. GO annotations % GO associations manual3.16% % GO associations electronic96.84 % GO associations interpro2GO59% Total no. proteins annotated to GO % UniProt GO annotated in total68.2% % UniProt GO annotated manually2.2% % UniProt GO annotated electronically66% % UniProt GO annotated through interpro2go61%
European Bioinformatics Institute GOA data flow Gene association files
European Bioinformatics Institute Gene Association file format
European Bioinformatics Institute Example GOA cow file
European Bioinformatics Institute Output from the GOA database GOA Cow New Redundant Non-Redundant: based on IPI Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc. GA slim for UniProt + GO slims
European Bioinformatics Institute GA Files for Non-redundant species Non-redundant complete protein set for each proteome is identified (>25% GO coverage) Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc. Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc. ftp://ftp.ebi.ac.uk/pub/databases/GO/goa ftp://ftp.ebi.ac.uk/pub/databases/integr8
European Bioinformatics Institute Uses of GOA data Access protein functional information Look at relationships between proteins, e.g. IntAct Connect biological information to gene expression data Determine functional composition of a proteome –using GO slim
European Bioinformatics Institute Find functional information on proteins Uses of GOA
European Bioinformatics Institute Find functional information on interaction proteins (IntAct) Uses of GOA
European Bioinformatics Institute Overview proteome with GO Slim Uses of GOA
European Bioinformatics Institute Microarray data analysis Proteomics data analysis Kislinger T et al, Mol Cell Proteomics, 2003 Larkin JE et al, Physiol Genomics, 2004 Cunliffe HE et al, Cancer Res, 2003 GO classification Analysis of high-throughput data according to GO Uses of GOA
European Bioinformatics Institute Future plans Continue deep level annotation of human, mouse and rat Manually annotate splice variants Outreach and inclusion of new datasets e.g. grape New electronic mappings, e.g. unipathway2go Ortholog prediction for electronic GO annotation Develop tools for annotation training
European Bioinformatics Institute Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA Curator David Binns & John Maslen QuickGO, GOA tools All EBI UniProtKB Curators, HAMAP(SIB), IntAct, GO Editorial EBI All GO Consortium & associate members Rolf Apweiler Head of sequence database group Acknowledgements