Introduction to the Gene Ontology

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
COG and GO tutorial.
CACAO Biocurator Training CACAO Fall CACAO Syllabus What is CACAO & why is it important? Training Examples.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
Comprehensive Annotation System for Infectious Disease Data Alexander Diehl University at Buffalo/The Jackson Laboratory IDO Workshop /9/2010.
BICH CACAO Biocurator Training Session #3.
Gene Ontology at WormBase: Making the Most of GO Annotations Kimberly Van Auken.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
CACAO training part 1 Jim Hu and Suzi Aleksander For UW Parkside Fall 2014.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
Adding GO for Large Datasets COST Functional Modeling Workshop April, Helsinki.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Workshop Aims NMSU GO Workshop 20 May Aims of this Workshop  WIIFM? modeling examples background information about GO modeling  Strategies for.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Introduction to the Gene Ontology GO Workshop 3-6 August 2010.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:
Prioritization of Avian GO Annotation , , Chicken ,06949,5163.4Rat ,69664, Mouse ,83036, Human.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
Nitrogen Fixing GO Annotations UW Fall 2013 Example.
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Extracting Biological Information from Gene Lists
Gene Annotation & Gene Ontology
Getting GO annotation for your dataset
CACAO Training ASM-JGI 2012.
Annotating with GO: an overview
Strategies for functional modeling
GO : the Gene Ontology & Functional enrichment analysis
Workshop Aims TAMU GO Workshop 17 May 2010.
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Annotation: linking literature to gene products
ID Mapping tools: Converting Accessions between Databases
GO Annotation from different sources
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
Gene expression analysis
Annotating Gene Products to the GO
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Presentation transcript:

Introduction to the Gene Ontology Genomic Annotation and Functional Modeling Workshop Maxwell H. Gluck Equine Research Center 15-16 November, 2011

Introduction to GO The Gene Ontology Consortium The Gene ontology A GO annotation example GO evidence codes no GO vs ND Making Annotations Multiple annotations - the gene association (ga) file Sources of GO

The Gene Ontology Consortium

http://www.geneontology.org/

The GO Consortium provides: central repository for ontology updates and annotations central mechanism for changing GO terms (adding, editing, deleting) quality checking for annotations consistency checks for how annotations are made by different groups central source of information for users co-ordination of annotation effort

GO Consortium and GO Groups: groups decide gene product set to annotate biocurator training tool development mostly by groups many non-consortium groups education and training by groups outreach to biocurators/databases by GOC

Annotation Strategy Experimental data Computational analysis Many species have a body of published, experimental data Detailed, species-specific annotation: ‘depth’ Requires manual annotation of literature  slow Computational analysis Can be automated  faster Gives ‘breadth’ of coverage across the genome Annotations are general Relatively few annotation pipelines

Releasing GO Annotations GO annotations are stored at individual databases Sanity checks as data is entered – is all the data required filled in? Databases do quality control (QC) checks and submit to GO GO Consortium runs additional QC and collates annotations Checked annotations are picked up by GO users eg. public databases, genome browsers, array vendors, GO expression analysis tools

AgBase Quality Checks & Releases AgBase Biocurators ‘sanity’ check AgBase biocuration interface ‘sanity’ check & GOC QC AgBase database GO analysis tools Microarray developers ‘sanity’ check UniProt db QuickGO browser GO analysis tools Microarray developers EBI GOA Project ‘sanity’ check: checks to ensure all appropriate information is captured, no obsolete GO:IDs are used, etc. ‘sanity’ check & GOC QC Public databases AmiGO browser GO analysis tools Microarray developers GO Consortium database

The Gene Ontology

Gene Ontology (GO) Not about genes! Not a single ontology Gene products: genes, transcripts, ncRNA, proteins The GO describes gene product function Not a single ontology Biological Process (BP or P) Molecular Function (MF or F) Cellular Component (CC or C) de facto method for functional annotation Widely used for functional genomics (high throughput).

What the GO doesn’t do: Does not describe individual gene products e.g. cytochrome c is not in the GO but oxidoreductase activity is Does not describe mutants or diseases, e.g. oncogenesis. Does not include sequence attributes, e.g., exons, introns, protein domains. Is not a database of sequences.

What is the Gene Ontology? “a controlled vocabulary that can be applied to all organisms even as knowledge of gene and protein roles in cells is accumulating and changing” assign functions to gene products at different levels, depending on how much is known about a gene product is used for a diverse range of species structured to be queried at different levels, eg: find all the chicken gene products in the genome that are involved in signal transduction zoom in on all the receptor tyrosine kinases human readable GO function has a digital tag to allow computational analysis of large datasets

relationships between terms Ontologies relationships between terms digital identifier (computers) description (humans) As of ontology version 1.2291 (30/09/2011) 35,029 terms, 100.0% defined * 21,439 biological process * 2,898 cellular component * 9,107 molecular function 1,585 obsolete terms (not included in figures above)

A GO Annotation example

A GO Annotation Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa Biological Process (BP or P) GO:0006633 fatty acid biosynthetic process TAS GO:0006120 mitochondrial electron transport, NADH to ubiquinone TAS GO:0008610 lipid biosynthetic process IEA Molecular Function (MF or F) GO:0005504 fatty acid binding IDA GO:0008137 NADH dehydrogenase (ubiquinone) activity TAS GO:0016491 oxidoreductase activity TAS GO:0000036 acyl carrier activity IEA NDUFAB1 Cellular Component (CC or C) GO:0005759 mitochondrial matrix IDA GO:0005747 mitochondrial respiratory chain complex I IDA GO:0005739 mitochondrion IEA

A GO Annotation Example NDUFAB1 (UniProt P52505) Bovine NADH dehydrogenase (ubiquinone) 1, alpha/beta subcomplex, 1, 8kDa aspect or ontology GO:ID (unique) GO term name GO evidence code

GO Evidence codes & Making annotations

Why record GO evidence code? GO did not initially record evidence for functional assertion: NR: Not Recorded “inferred from…” deduce or conclude (information) from evidence and reasoning provides information about the support for associating a gene product with a function different experiments allow us to draw different conclusions reliability

Types of GO Evidence Codes Experimental Evidence Codes Computational Analysis Evidence Codes Author Statement Evidence Codes Curator Statement Evidence Codes Automatically-assigned Evidence Codes Obsolete Evidence Codes

Guide to GO Evidence Codes http://www.geneontology.org/GO.evidence.shtml GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

GO Mapping Example GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example Biocuration of literature detailed function “depth” slower (manual) NDUFAB1

Biocuration of Literature: P05147 Biocuration of Literature: detailed gene function Find a paper about the protein. PMID: 2976880

Read paper to get experimental evidence of function Use most specific term possible Read paper to get experimental evidence of function experiment assayed kinase activity: use IDA evidence code Same piece of data (IDA) demonstrates that this gene product inhibit protein kinase activity and thus is involved in the negative regulation of protein amino acid phosphorylation

GO Mapping Example GO EVIDENCE CODES Direct Evidence Codes IDA - inferred from direct assay IEP - inferred from expression pattern IGI - inferred from genetic interaction IMP - inferred from mutant phenotype IPI - inferred from physical interaction Indirect Evidence Codes inferred from literature IGC - inferred from genomic context TAS - traceable author statement NAS - non-traceable author statement IC - inferred by curator inferred by sequence analysis RCA - inferred from reviewed computational analysis IS* - inferred from sequence* IEA - inferred from electronic annotation Other NR - not recorded (historical) ND - no biological data available GO Mapping Example Biocuration of literature detailed function “depth” slower (manual) Sequence analysis rapid (computational) “breadth” of coverage less detailed NDUFAB1 ISS - inferred from sequence or structural similarity ISA - inferred from sequence alignment ISO - inferred from sequence orthology ISM - inferred from sequence model

Computational Analysis Evidence In the beginning: IGC: Inferred from Genomic Context e.g. operons RCA: inferred from Reviewed Computational Analysis computational analyses that integrate datasets of several types ISS: Inferred from Sequence or Structural Similarity

Computational Analysis Evidence Then different types of sequence analysis added: ISS: Inferred from Sequence or Structural Similarity ISO: Inferred from Sequence Orthology ISA: Inferred from Sequence Alignment ISM: Inferred from Sequence Model

Computational Analysis Evidence Phylogenetic analysis codes added: IBA: Inferred from Biological aspect of Ancestor IBD: Inferred from Biological aspect of Descendant IKR: Inferred from Key Residues characterized by the loss of key sequence residues - implies a NOT annotation IRD: Inferred from Rapid Divergence characterized by rapid divergence from ancestral sequence – implies a NOT annotation Exact details/use under discussion.

Unknown Function vs No GO ND – no data Biocurators have tried to add GO but there is no functional data available Previously: “process_unknown”, “function_unknown”, “component_unknown” Now: “biological process”, “molecular function”, “cellular component” No annotations (including no “ND”): biocurators have not annotated this is important for your dataset: what % has GO?

Multiple Annotations: gene association files

The gene association (ga) file standard file format used to capture GO annotation data tab-delimited file containing 15* fields of information: Information about the gene product (database, accession, name, symbol, synonyms, species) information about the function: GO ID, ontology, reference, evidence, qualifiers, context (with/from) data about the functional annotation date, annotator * GO Annotation File Format 2.0 has two additional columns compared to GAF 1.0: annotation extension (column 16) and gene product form ID (column 17).

http://www.geneontology.org/GO.format.gaf-2_0.shtml

(additional column added to this example)

gene product information

metadata: when & who

function information

Used to give more specific information about the evidence code (not always displayed)

Used to qualify the annotation (not always displayed)

Gene association files GO Consortium ga files many organism specific files also includes EBI GOA files EBI GOA ga files UniProt file contains GO annotation for all species represented in UniProtKB AgBase ga files organism specific files AgBase GOC file – submitted to GO Consortium & EBI GOA AgBase Community file – GO annotations not yet submitted or not supported / annotations provided by researchers all files are quality checked

http://www.geneontology.org

http://www.ebi.ac.uk/GOA/

http://www.agbase.msstate.edu/

Sources of GO Primary sources of GO: from the GO Consortium (GOC) & GOC members most up to date most comprehensive Secondary sources: other resources that use GO provided by GOC members public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) GO expression analysis tools

Sources of GO annotation Different tools and databases display the GO annotations differently. Since GO terms are continually changing and GO annotations are continually added, need to know when GO annotations were last updated.

Secondary Sources of GO annotation EXAMPLES: public databases (eg. NCBI, UniProtKB) genome browsers (eg. Ensembl) array vendors (eg. Affymetrix) CONSIDERATIONS: What is the original source? When was it last updated? Are evidence codes displayed?

Differences in displaying GO annotations: secondary/tertiary sources.

For more information about GO GO Evidence Codes: http://www.geneontology.org/GO.evidence.shtml gene association file information: http://www.geneontology.org/GO.format.annotation.shtml tools that use the GO: http://www.geneontology.org/GO.tools.shtml GO Consortium wiki: http://wiki.geneontology.org/index.php/Main_Page All websites are listed on the AgBase workshop website.