European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder

Slides:



Advertisements
Similar presentations
Applications of GO. Goals of Gene Ontology Project.
Advertisements

Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The Gene Ontology Consortium Jennifer Clark, GO Editorial Office.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
IST Computational Biology1 Information Retrieval Biological Databases 2 Pedro Fernandes Instituto Gulbenkian de Ciência, Oeiras PT.
David Binns, * Emily Dimmer, Rachael Huntley, Daniel Barrell, Claire O'Donovan, and Rolf Apweiler.
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
UniProt - The Universal Protein Resource
Data retrieval BioMart Data sets on ftp site MySQL queries of databases Perl API access to databases Export View.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Methods for Creating GO Annotations Emily Dimmer European Bioinformatics Institute Wellcome Trust Genome Campus Cambridge UK.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Introduction to the Gene Ontology and GO annotation resources
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Ontologies, data standards and controlled vocabularies.
EBI is an Outstation of the European Molecular Biology Laboratory. Introduction to the Gene Ontology and GO annotation resources Rachael Huntley UniProtKB-GOA.
Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology Project
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
EBI is an Outstation of the European Molecular Biology Laboratory. GOA: Looking after GO annotations Emily Dimmer Gene Ontology Annotation (GOA) Database.
Lecture Four: GO: The Gene Ontology ----Infrastructure for Systems Biology.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Expanding GO annotations with text classification Nicko Goncharoff Reel Two, Inc.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Increasing GO Annotation Through Community Involvement Fiona McCarthy*, Nan Wang*, Susan Bridges** and Shane Burgess** GO.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Gene Ontology Consortium
EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Computer Science Ph. D. Seminar Gene Ontology (GO) Based Search for Protein Structure Similarity Clustering Metrics Ph.D. Candidate Steve Johnson Committee.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
The Protein Identifier Cross-Reference (PICR) service.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
S. pombe Unicellular archiascomycete Diverged from S. cerevisiae Ma Size ~14 Mb, 3 chromosomes No synteny Data stored in GeneDB.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Gene Ontology TM (GO) Consortium
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Module 1: Gene Lists 1 Canadian Bioinformatics Workshops
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Canadian Bioinformatics Workshops
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Gene Annotation & Gene Ontology
Getting GO annotation for your dataset
Annotating with GO: an overview
Introduction to the Gene Ontology
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
Genome Annotation Continued
Presentation transcript:

European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder

European Bioinformatics Institute Contents Introduction to GOA Manual GOA annotation Electronic annotation: –InterPro2GO GOA data flow Uses of GOA Future plans

European Bioinformatics Institute What is GO annotation? An annotation is a statement that a gene product has a particular molecular function is involved in a particular biological process is located within a certain cellular component …as determined by a particular method …as described in a particular reference. GO Term ID GO Term ID Evidence Code Evidence Code Reference

European Bioinformatics Institute Gene Ontology Annotation (GOA) Database GOA’s priority is to annotate the human, mouse and rat proteomes Largest open-source contributor of annotations to GO Provides 10 million annotations for more than 111,000 species Share and integrate GO annotation

European Bioinformatics Institute How do we annotate GO terms  Manual Annotation  Electronic Annotation All annotations must: be attributed to a source indicate what evidence was found to support the GO term-gene/protein association

European Bioinformatics Institute Manual annotation High quality Specific gene or gene product associations made using: –Peer reviewed papers –Evidence codes BUT: –Time-consuming –Requires trained biologists

European Bioinformatics Institute Manual GO annotation Read papersFind GO termAnnotate to protein GOA-association file Oracle RDBMS Pubmed ID, Evidence code GO and EBI ftp sites

European Bioinformatics Institute Protein2GO tool Online

European Bioinformatics Institute Information captured by GOA SourceGOIDTermEvidRefDB RefIDWith DBWith IDQualifier

European Bioinformatics Institute How successful is manual-GOA ? SourceNo. of annotationsNo. of distinct proteins Proteome Inc UniProt IntAct MGI SGD FlyBase RGD HGNC GeneDB TAIR/TIGR ZFIN Roslin Institute146 AgBase Reactome1512 WormBase TIGR13979 Gramene GDB TOTAL MANUAL July taxa

European Bioinformatics Institute Large-scale assignment of GO terms to UniProtKB entries using existing information within database entries and manual mappings Get IEA evidence code Electronic Annotation Curated mapping e.g. EC: > GO:alcohol dehydrogenase activity ; GO: UniProt KeywordHAMAPInterProEC GO Curated or electronic rule based mappings High quality electronic protein to GO associations

European Bioinformatics Institute

European Bioinformatics Institute Mappings of external concepts to GO

European Bioinformatics Institute InterPro2GO mapping InterPro is a resource that integrates protein signatures databases, e.g. Pfam, Prints, Prosite, ProDom, SMART, TIGRFAMs etc. It provides a means of classifying proteins into families and identifying domains. Each InterPro entry groups proteins belonging to the same family and potentially having the same function

European Bioinformatics Institute InterPro2Go mapping Done manually, but using tools Look at InterPro and protein annotation For all Swiss-Prot proteins matching entry truly: –Get stats on DE lines, keywords, comments –Check how conserved common annotation is –Find appropriate GO term at most specific level that applies to all proteins (not necessarily domains)

European Bioinformatics Institute Tools used –”SQUID” Statistics options: keyword description Gene name Organism Comments, etc.

European Bioinformatics Institute SQUID statistics output

European Bioinformatics Institute SQUID statistics output

European Bioinformatics Institute InterPro2GO mapping in entry

European Bioinformatics Institute InterProScan output with GO terms

European Bioinformatics Institute InterPro2GO sanity checks Run weekly Reports: Obsolete GO terms Obsolete (deleted) IPRs Secondary IPRs

European Bioinformatics Institute Quality of GO mapping BioCreAtIvE test set -635 GO annotations through InterPro2GO Camon et al., 2005, BMC Bioinformatics Manually checked 44 proteins, 107 predictions: 97 correct (90%): -40 exact -57 same lineage 10 new lineage (unknown) 0 incorrect Exact term15124% Same lineage < granularity27343% Same lineage > granularity244% New lineage18729% Minimal correct42467% Potentially incorrect21133% Precision %

European Bioinformatics Institute InterPro2GO mapping statistics Total no. IPRS mapped to GO7126 % of IPRs mapped to at least 1 GO term54% No. IPRS mapped to molecular function5741 No. IPRS mapped to biological process5543 No. IPRS mapped to cellular component3426 No. GO terms mapped2811 No. UniProt proteins mapped through interpro2go (61%) % UniProt covered by InterPro77.6%

European Bioinformatics Institute Provides large coverage High Quality However these annotations often use high-level GO terms and provide little detail. How successful is IEA-GOA in general? IEA MethodNo. of annotationsNo. of distinct proteins InterPro2GO HAMAP2GO SP Keyword2GO EC2GO TOTAL Jun 2006 Manual ones:

European Bioinformatics Institute Total GO statistics Total no. GO annotations % GO associations manual3.16% % GO associations electronic96.84 % GO associations interpro2GO59% Total no. proteins annotated to GO % UniProt GO annotated in total68.2% % UniProt GO annotated manually2.2% % UniProt GO annotated electronically66% % UniProt GO annotated through interpro2go61%

European Bioinformatics Institute GOA data flow Gene association files

European Bioinformatics Institute Gene Association file format

European Bioinformatics Institute Example GOA cow file

European Bioinformatics Institute Output from the GOA database GOA Cow New Redundant Non-Redundant: based on IPI Data also available in SRS, UniProt, QuickGO, MODs, Ensembl etc. GA slim for UniProt + GO slims

European Bioinformatics Institute GA Files for Non-redundant species Non-redundant complete protein set for each proteome is identified (>25% GO coverage) Includes UniProt, IPI and MOD-specific IDs, e.g. mouse (MGI), rat (RGD), zebrafish (ZFIN) etc. Xref files available with identifiers from: UniProt, IPI, RefSeq, Ensembl, UniGene etc. ftp://ftp.ebi.ac.uk/pub/databases/GO/goa ftp://ftp.ebi.ac.uk/pub/databases/integr8

European Bioinformatics Institute Uses of GOA data Access protein functional information Look at relationships between proteins, e.g. IntAct Connect biological information to gene expression data Determine functional composition of a proteome –using GO slim

European Bioinformatics Institute Find functional information on proteins Uses of GOA

European Bioinformatics Institute Find functional information on interaction proteins (IntAct) Uses of GOA

European Bioinformatics Institute Overview proteome with GO Slim Uses of GOA

European Bioinformatics Institute Microarray data analysis Proteomics data analysis Kislinger T et al, Mol Cell Proteomics, 2003 Larkin JE et al, Physiol Genomics, 2004 Cunliffe HE et al, Cancer Res, 2003 GO classification Analysis of high-throughput data according to GO Uses of GOA

European Bioinformatics Institute Future plans Continue deep level annotation of human, mouse and rat Manually annotate splice variants Outreach and inclusion of new datasets e.g. grape New electronic mappings, e.g. unipathway2go Ortholog prediction for electronic GO annotation Develop tools for annotation training

European Bioinformatics Institute Evelyn Camon GOA Coordinator Daniel Barrell GOA Programmer Emily Dimmer GOA Curator Rachael Huntley GOA Curator David Binns & John Maslen QuickGO, GOA tools All EBI UniProtKB Curators, HAMAP(SIB), IntAct, GO Editorial EBI All GO Consortium & associate members Rolf Apweiler Head of sequence database group Acknowledgements