Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to the Gene Ontology and GO annotation resources

Similar presentations


Presentation on theme: "Introduction to the Gene Ontology and GO annotation resources"— Presentation transcript:

1 Introduction to the Gene Ontology and GO annotation resources
Rachael Huntley UniProt-GOA EBI Programmatic Access to Biological Databases 3rd October 2012

2 What is an Ontology? What is the difference between an ontology and a controlled vocabulary?

3 An ontology formally represents knowledge as a set of concepts within a domain, and the relationships among those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. A Controlled vocabularies provide a way to organize knowledge for subsequent retrieval but does not allow reasoning about the entities.

4 Ontologies and CVs in Biology
Ontologies/CVs are heavily used in biological databases Allow organisation of data within a database Enable linking between databases Enable searches across databases

5

6

7 What is GO?

8 The Gene Ontology A way to capture
Less specific concepts A way to capture biological knowledge for individual gene products in a written and computable form A set of concepts and their relationships to each other arranged as a hierarchy More specific concepts

9 The Concepts in GO 1. Molecular Function 2. Biological Process
protein kinase activity insulin receptor activity An elemental activity or task or job 2. Biological Process A commonly recognised series of events cell division mitochondrion mitochondrial matrix mitochondrial inner membrane 3. Cellular Component Where a gene product is located

10 Anatomy of a GO term Unique identifier Term name Synonyms Definition
Cross-references

11 Ontology structure Directed acyclic graph Terms are linked by
Terms can have more than one parent Terms are linked by relationships is_a part_of regulates (and +/- regulates) has_part occurs_in These relationships allow for complex analysis of large datasets

12 Relations Between GO Terms
is_a If A is a B, then A is a subtype of B part_of Wherever B exists, it is as part of A. But not all A is part of B. All replication forks are part of a chromosome Not all chromosomes have replication forks A B

13 Relations Between GO Terms Regulates
One process directly affects another process or quality Necessarily regulates: if both A and B are present, B always regulates A, but A may not always be regulated by B A B All cell cycle checkpoints regulate the cell cycle. The cell cycle is not solely regulated by cell cycle checkpoints

14 Process-Function Links in GO
GO was originally three completely independent hierarchies, with no relationships between them Biological processes are ordered assemblies of molecular functions As of 2009 we have started making relationships between biological process and molecular function in the live ontology functions that regulate processes e.g. transcription regulator regulates transcription process process function functions that are part_of processes e.g. transporter part_of transport function

15 Searching for GO terms Search GO terms or proteins

16 Exercise Search for a GO term Exercise 1 (pg.16)

17 Why do we need GO?

18 Reasons for the Gene Ontology
Inconsistency in English language

19 Inconsistency in English languauge
Same name for different concepts Cell or The English language is not precise ??

20 Eggplant Brinjal Aubergine Melongene Same for biological concepts
Different names for the same concept Eggplant Brinjal Aubergine Melongene Same for biological concepts Comparison is difficult – in particular across species or across databases Just one reason why the Gene Ontology (GO) is is needed…

21 Reasons for the Gene Ontology
Inconsistency in English language Increasing amounts of biological data available Increasing amounts of biological data to come

22 Increasing amounts of biological data available
Search on ‘DNA repair’... get over 68,000 results Expansion of sequence information

23 Reasons for the Gene Ontology
Inconsistency in English language Increasing amounts of biological data available Increasing amounts of biological data to come Large datasets need to be interpreted quickly

24 • Compile the ontologies
Aims of the GO project • Compile the ontologies - currently over 38,000 terms - constantly increasing and improving • Annotate gene products using ontology terms - around 30 groups provide annotations • Provide a public resource of data and tools - regular releases of annotations - tools for browsing/querying annotations and editing the ontology

25 Reactome

26 GO Annotation

27 UniProt-Gene Ontology Annotation (UniProt-GOA) project at the EBI
Largest open-source contributor of annotations to GO Provides annotation for more than 350,000 species Our priority is to annotate the human proteome

28 A GO annotation is … …a statement that a gene product;
has a particular molecular function or is involved in a particular biological process or is located within a certain cellular component as determined by a particular method as described in a particular reference P00505 Accession Name GO ID GO term name Reference Evidence code IDA PMID: aspartate transaminase activity GO: GOT2

29 UniProt-GOA incorporates annotations made using two methods
Electronic Annotation Quick way of producing large numbers of annotations Annotations use less-specific GO terms Only source of annotation for many non-model organism species Manual Annotation Time-consuming process producing lower numbers of annotations Annotations tend to use very specific GO terms

30 Electronic annotation methods
Mapping of external concepts to GO terms GO: : Nucleus GO: : MAP kinase activity GO: : Auxin mediated signaling pathway

31 Electronic annotation methods
2. Automatic transfer of manual annotations to orthologs Ensembl compara Macaque Chimpanzee Guinea Pig Rat ...and more e.g. Human Mouse Cow Dog Chicken Arabidopsis Ensembl compara …and more Electronic annotations are important for non-model species as these do not have a dedicated manual annotation effort Poplar Brachypodium Maize Grape Rice Annotations are high-quality and have an explanation of the method (GO_REF)

32 Manual annotation by GOA
High–quality, specific annotations made using: Full text peer-reviewed papers A range of evidence codes to categorise the types of evidence found in a paper e.g. IDA, IMP, IPI

33 Number of annotations in UniProt-GOA database
Electronic annotations 102,205,043 1,149,802 Manual annotations* Sep 2012 Statistics * Includes manual annotations integrated from external model organism and specialist groups

34 How to access and use GO annotation data

35 Where can you find annotations?
UniProtKB Ensembl Entrez gene

36 UniProt vs. QuickGO annotation display

37 UniProt vs. QuickGO annotation display
Filtering mechanism Exclude root terms (e.g. Molecular Function) Exclude annotations with qualifier (e.g. NOT, contributes_to) Exclude annotations to less granular terms Exclude annotations to GO: protein binding Exclude lower quality assignments for same data, e.g. UniProt taken in preference to MGI Add electronic annotations that cover ground not covered by manual annotation

38 Gene Association Files
17 column files containing all information for each annotation GO Consortium website UniProt-GOA website Numerous species-specific files

39 GO browsers

40 The EBI's QuickGO browser
Search GO terms or proteins Find sets of GO annotations

41 Exercise Find annotations to a protein Exercise 2 (pg.16)
Find annotations to a list of proteins Exercise 1 and 2 (pg.22)

42 How scientists use the GO
Access gene product functional information Analyse high-throughput genomic or proteomic datasets Validation of experimental techniques Get a broad overview of a proteome Obtain functional information for novel gene products  Some examples…

43 Term enrichment Most popular type of GO analysis
Determines which GO terms are more often associated with a specified list of genes/proteins compared with a control list or rest of genome Many tools available to do this analysis User must decide which is best for their analysis

44 Analysis of high-throughput genomic datasets
attacked time control Puparial adhesion Molting cycle Hemocyanin Defense response Immune response Response to stimulus Toll regulated genes JAK-STAT regulated genes Amino acid catabolism Lipid metobolism Peptidase activity Protein catabolism MicroArray data analysis Genes differentially expressed during parasitoid attack versus control Bregje Wertheim at the Centre for Evolutionary Genomics, Department of Biology, UCL and Eugene Schuster Group, EBI.

45 Annotating novel sequences
Can use BLAST queries to find similar sequences with GO annotation which can be transferred to the new sequence Two tools currently available; AmiGO BLAST – searches the GO Consortium database BLAST2GO – searches the NCBI database

46 Using the GO to provide a functional overview for a large dataset
Many GO analysis tools use GO slims to give a broad overview of the dataset GO slims are cut-down versions of the GO and contain a subset of the terms in the whole GO GO slims usually contain less-specialised GO terms

47 Slimming the GO using the ‘true path rule’
Many gene products are associated with a large number of descriptive, leaf GO nodes:

48 Slimming the GO using the ‘true path rule’
…however annotations can be mapped up to a smaller set of parent GO terms:

49 GO slims Custom slims are available for download;
or you can make your own using; QuickGO AmiGO's GO slimmer

50 The EBI's QuickGO browser
Search GO terms or proteins Find sets of GO annotations Map-up annotations with GO slims

51 Precautions when using GO annotations for analysis
The Gene Ontology is always changing and GO annotations are continually being created - always use a current version of both - if publishing your analyses please report the versions/dates you used Recommended that ‘NOT’ annotations are removed before analysis - only ~3000 out of 57 million annotations are ‘NOT’ - can confuse the analysis

52 Precautions when using GO annotations for analysis
Unannotated is not unknown - where there is no evidence in the literature for a process, function or location the gene product is annotated to the appropriate ontology’s root node with an ‘ND’ evidence code (no biological data), thereby distinguishing between unannotated and unknown Pay attention to under-represented GO terms - a strong under-representation of a pathway may mean that normal functioning of that pathway is necessary for the given condition

53 Exercise Use the QuickGO web services to retrieve annotations
Exercise 1 Pg.46

54 The UniProt-GOA group Project leader: Rachael Huntley Curators:
Yasmin Alam-Faruque Prudence Mutowo Software developer: Tony Sawford Team leaders: Rolf Apweiler Claire O’Donovan

55 Acknowledgements Members of; UniProtKB InterPro Ensembl IntAct
HAMAP Ensembl Ensembl Genomes GO Consortium Funding National Human Genome Research Institute (NHGRI) British Heart Foundation EMBL


Download ppt "Introduction to the Gene Ontology and GO annotation resources"

Similar presentations


Ads by Google