Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Comprehensive Annotation System for Infectious Disease Data Alexander Diehl University at Buffalo/The Jackson Laboratory IDO Workshop /9/2010.
Protein and Function Databases
BICH CACAO Biocurator Training Session #3.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
CACAO training part 1 Jim Hu and Suzi Aleksander For UW Parkside Fall 2014.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
SRI International Bioinformatics 1 Submitting pathway to MetaCyc Ron Caspi.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Operated by Los Alamos National Security, LLC for NNSA Bioscience Discovering virulence genes present in novel strains and metagenomes Chris Stubben IC.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
A Common Language for Annotation of Genes from Yeast, Flies and Mice The Gene Ontologies …and Plants and Worms …and Humans …and anything else!
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Update Susan Bridges, Fiona McCarthy, Shane Burgess NRI
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
An example of GO annotation from a primary paper Rebecca E. Foulger (UniProt Curator) GO Annotation Camp, June 2005 PMID:
InterPro Sandra Orchard.
An example of GO annotation from a primary paper GO Annotation Camp, July 2006 PMID:
Welcome to the Protein Database Tutorial. This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Nitrogen Fixing GO Annotations UW Fall 2013 Example.
Protein families, domains and motifs in functional prediction May 31, 2016.
CACAO Training Jim Hu and Suzi Aleksander Fall 2015.
Gene Annotation & Gene Ontology
Protein families, domains and motifs in functional prediction
CACAO Training ASM-JGI 2012.
Annotating with GO: an overview
Introduction to the Gene Ontology
Sequence based searches:
Department of Genetics • Stanford University School of Medicine
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Genome Annotation Continued
Predicting Active Site Residue Annotations in the Pfam Database
Gene expression analysis
Annotating Gene Products to the GO
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Presentation transcript:

Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse Genome Informatics Bar Harbor, ME

What is an annotation? An annotation is a statement that a gene product … …has a particular molecular function …is involved in a particular biological process …is located within a certain cellular component …as determined by a particular method …as described in a particular reference. Evidence Code Evidence Code GO Term GO Term Reference Smith et al. determined by a direct assay that Abc2 has protein kinase activity, is involved in the process of protein phosphorylation, and is located in the cytoplasm.

Anatomy of an annotation Object (gene product (or transcript coding for it, or gene that codes for it )‏ GO Term from most recent GO GO Term Qualifier (optional)‏ NOT, Co_localizes with, or Contributes_to Evidence Code : IDA, IPI, IMP, IEP, IGI, ISS, IEA, TAS, NAS, or IC Evidence Code Qualifier (required for some codes )‏ Used in combination with IPI, IMP, IGI, ISS, and IEA Seq_ID or DB_ID required. Reference: literature or database specific reference DB_ID or PMID

How does a gene product get GO annotation

Annotation Strategies Electronic (IEA)‏ Good for first pass Usually based on some sort of sequence comparisons (but use ISS if paper based)‏ IP2GO (InterPro Domains to GO SPTR2GO (UniProt to GO)‏ EC2GO (Enzyme commission)‏

Manual Annotation

P05147 PMID: Find a paper about the protein.

Read the entire paper

P05147 PMID: GO: Find the GO term describing its function, process or location of action.

Literature selection A paper is selected for GO curation of a mouse* gene product if: A paper gives direct experimental evidence for the normal function, process, or cellular location of a mouse* gene product (IDA, IMP, IGG, IPI). A paper gives direct experimental evidence for the normal function, process, or cellular location of a non- mouse gene product AND the paper presents homology data to a mouse gene product (ISS)‏ *** = substitute YOUR organism

Getting the GO

Annotation process READ the full papers! Abstracts alone can be very misleading Quite often, the species are not specified. Sometimes a paper uses human, mouse and rat interchangeably, or uses human for one gene and mouse for a different one.

P05147 PMID: IDA GO: What evidence do they show?

GO Evidence Codes Use with annotation to unknown Inferred from Sequence Similarity*ISS Inferred from Curator*IC Inferred from Expression PatternIEP Inferred from Mutant PhenotypeIMP Inferred from Genetic Interaction*IGI Inferred from Physical Interaction*IPI Inferred from Direct AssayIDA No DataND Traceable Author StatementTAS Non-traceable Author StatementNAS Inferred from Electronic AnnotationIEA DefinitionCode Manually annotated

Evidence Codes for GO Annotations Indicate the type of evidence in the cited source* that supports the association between the gene product and the GO term

Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS***, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND Types of evidence codes

IDA Inferred from Direct Assay direct assay for the function, process, or component indicated by the GO term Enzyme assays In vitro reconstitution (e.g. transcription)‏ Immunofluorescence (for cellular component)‏ Cell fractionation (for cellular component)‏

IMP Inferred from Mutant Phenotype variations or changes such as mutations or abnormal levels of a single gene product Gene/protein mutation Deletion mutant RNAi experiments Specific protein inhibitors Allelic variation

IGI Inferred from Genetic Interaction Any combination of alterations in the sequence or expression of more than one gene or gene product Traditional genetic screens - Suppressors, synthetic lethals Functional complementation Rescue experiments An entry in the ‘with’ column is recommended

IPI Inferred from Physical Interaction Any physical interaction between a gene product and another molecule, ion, or complex 2-hybrid interactions Co-purification Co-immunoprecipitation Protein binding experiments An entry in the ‘with’ column is recommended

IEP Inferred from Expression Pattern Timing or location of expression of a gene Transcript levels Northerns, microarray Caution when interpreting expression results

ISS Inferred from Sequence or structural Similarity Sequence alignment, structure comparison, or evaluation of sequence features such as composition Sequence similarity Recognized domains/overall architecture of protein An entry in the ‘with’ column is recommended BUT, we have now expanded this

Three flavors of ISS ISA: Inferred from Sequence Alignment Sequence similarity with experimentally characterized gene products, as determined by alignments, either pairwise or multiple (tools such as BLAST, ClustalW, etc). An entry in the with field is mandatory. ISO: Inferred from Sequence Orthology Pairwise or multiple alignments between a query protein and experimentally characterized match proteins when the proteins are established to be orthologs of each other. An entry in the with field is mandatory.

ISS flavors con't ISM: Inferred from Sequence Model Prediction methods for non-coding RNA genes ( tRNASCAN-SE, Snoscan, and Rfam)‏ Predicted presence of recognized functional domains or membership in protein families ( profile Hidden Markov Models (HMMs), including Pfam and TIGRFAM)‏ Predicted protein features using tools such as TMHMM (transmembrane regions), SignalP (signal peptides on secreted proteins), and TargetP (subcellular localization)‏ any other kind of domain modeling tool or collections of them such as SMART, PROSITE, PANTHER, InterPro, etc. An entry in the with field is required when the model used is an object with an accession number (as found with Pfam, TIGRFAM, InterPro, PROSITE, Rfam, etc.) Left blank for tools such as tRNAscan and Snoscan where there is not an object with an accession to point to.

Used where an annotation is not supported by any evidence, but can be reasonably inferred by a curator from other GO annotations, for which evidence is available. The ‘with’ field is required, and is populated by a GO id using the same reference Example: Ref. 1 shows that a gene product has chloride channel activity (GO: :) by direct assay (IDA). A curator can then add the component annotation ‘integral to membrane’ (GO: ) using the IC evidence code and put GO: in the “with” field. Caution: The IC evidence code should not be used for something obvious. For example, if a gene product is being annotated to the function “protein kinase activity” (GO: ) by IDA, then it is also involved in the process “protein amino acid phosphorylation” (GO: ) by the same experiment (IDA). Inferred from Curator (IC)

IEA Inferred from Electronic Annotation depend directly on computation or automated transfer of annotations from a database Hits from BLAST searches InterPro2GO mappings No manual checking Entry in ‘with’ column is allowed (ex. sequence ID)‏

RCA Inferred from Reviewed Computational Analysis non-sequence-based computational method large-scale experiments genome-wide two-hybrid genome-wide synthetic interactions integration of large-scale datasets of several types text-based computation (text mining)‏

TAS Traceable Author Statement Publication used to support an annotation doesn't show the evidence Review article Not encouraged for RefGenomes Would be better to track down cited reference and use an experimental code Statements in a paper that cannot be traced to another publication Not recommended for use by RefGenomes NAS Non-traceable Author Statement

IGC Inferred from Genomic Context Chromosomal position Most often used for Bacteria - operons Direct evidence for a gene being involved in a process is minimal, but for surrounding genes in the operon, the evidence is well-established

ND: No Data v.s. Unannotated GO annotates to the three root terms when the curator has determined that there is no existing literature to support an annotation. It is presumed that if it exists, it must have a Biological_process GO: , and a Molecular_function GO: , and be located in a Cellular_component GO: The evidence code ND is used Date of annotations tracted for QC These are NOT the same as having no annotation at all. No annotation means that no one has looked yet.

Example Annotations

Abstract suggests that this paper demonstrates that Ibtk –Binds to a protein kinase –Inhibits kinase activity –Inhibits calcium mobolization –Inhibits transcription

Evidence used for process and function Use most specific term possible Both IDA

Both Btk and iBtk have protein binding activity to each other, IPI evidence code

IDA evidence code

Abstract totally misses the sub-cellular localization!!!

Example: IMP

Some Special Cases

Qualifiers GO Term Qualifiers “NOT” Can be used with any term “contributes_to” Used for molecular function “co_localizes with” Used with cellular component Evidence Code Qualifiers Sequence ID (for ISS)‏ Protein ID (for IPI and protein binding)‏ Mutant ID (for IMP)‏ Gene (for IGI)‏ GO ID (for IC)‏

'NOT' is used to make an explicit note that the gene product is not associated with the GO term. This is particularly important in cases where associating a GO term with a gene product should be avoided (but might otherwise be made, especially by an automated method). e.g. This protein does not have ‘kinase activity’ because the Author states that this protein has a disrupted/missing an ‘ATP binding’ domain. Also used to document conflicting claims in the literature. NOT can be used with ALL three GO Ontologies. The “not” GO Term Qualifier

The ‘contributes_to’ qualifier The Qualifier documentation: Contributes_to: An individual gene product that is part of a complex can be annotated to terms that describe the action (function or process) of the complex. This practice is colloquially known as annotating 'to the potential of the complex‘. This qualifer allows us to distinguish the individual subunit from complex functions e.g. contributes_to ribosome binding when part of a complex but does not perform this function on its own. All gene products annotated using 'contributes_to' must also be annotated to a cellular component term representing the complex that possesses the activity. Only used with GO Function Ontology

Annotate to finest granularity Annotating to GO: automatically annotates to all of its parents; thus a product is annotated to both protein modification AND cytoskeleton organization

GO Does not annotate substrates A gene product that has protein kinase activity is also involved in the process of protein phosphorylation The protein that gets phosphorylated is NOT involved in the process of protein phosphorylation.

Sharing Annotations The Gene Association File

Annotation appears in AmiGO. Amigo Browser: –A GO browser that tracks contributed GO annotations across species. –Uses annotation sets supplied in a specific format.

Clark et al., 2005 Many species groups annotate. We see the research of one function across all species.

The Gene Association File The Gene Assocation File (GAF) is the means by which ~20 MODS using the GO for annotation to annotate ~ 50 organisms can share the annotations 15 column tab delimited text file

Anatomy of a gene association file

Changes in October In recognition that a gene can code for more than one protein isoform (due to both post transcriptional (splicing, editing) and post-translational (cleavage, etc.) processes AND in recognition that many MODs primary annotation object is “Gene” The format of the Gene Association file is scheduled to change

Two New Columns!

Column 2 Column 12 Column 17 Column 2 No change: MGI_id for the gene object Column 12 must always refer to column 17. if isoform, then “protein” if gene, then “gene” Column 17 Always filled If no isoform,the default Dbx_ref will be the MGI_id for the gene (identical entry to column 2 Dbx_ref for protein isoform when known (requires new field in EI; mine from private notes “gene_product” field until then

What about column 16? This column is “reserved” at the moment but May be used to specify such things as target of GO activity (eg, If a gene_product is a protein kinase, what are it's target protein)‏ other? modification? For the time being, the column will be blank