Gene Annotation & Gene Ontology

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
24th Feb 2006 Jane Lomax Gene Ontology tutorial Talk:Using the Gene Ontology (GO) for Expression Analysis Practical:Onto-Express analysis tool Talk: GO.
25th June 2007 Jane Lomax Using the Gene Ontology (GO) for analysis of expression data Jane Lomax EMBL-EBI.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Gene Ontology John Pinney
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
CACAO - Remote training Gene Function and Gene Ontology Fall 2011
COG and GO tutorial.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
CACAO Biocurator Training CACAO Fall CACAO Syllabus What is CACAO & why is it important? Training Examples.
Protein and Function Databases
BICH CACAO Biocurator Training Session #3.
Lecture 4: Gene Annotation & Gene Ontology June 11, 2015.
PAT project Advanced bioinformatics tools for analyzing the Arabidopsis genome Proteins of Arabidopsis thaliana (PAT) & Gene Ontology (GO) Hongyu Zhang,
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
Automatic methods for functional annotation of sequences Petri Törönen.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
Using The Gene Ontology: Gene Product Annotation.
Gene Ontology (GO) Project
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Annotating Gene Products to the GO Harold J Drabkin Senior Scientific Curator The Jackson Laboratory Mouse.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Gene Product Annotation using the GO ml Harold J Drabkin Senior Scientific Curator The Jackson Laboratory.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.
Protein and RNA Families
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Operated by Los Alamos National Security, LLC for NNSA Bioscience Discovering virulence genes present in novel strains and metagenomes Chris Stubben IC.
Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
CACAO Training Fall Community Assessment of Community Annotation with Ontologies (CACAO)
Gene Ontology Consortium
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Gene Ontology TM (GO) Consortium
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Extracting Biological Information from Gene Lists
Networks and Interactions
CACAO Training ASM-JGI 2012.
Annotating with GO: an overview
Pathway Analysis June 13, 2017.
GO : the Gene Ontology & Functional enrichment analysis
Introduction to the Gene Ontology
Mental Functioning and the Gene Ontology
Statistical Testing with Genes
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
Modified from slides from Jim Hu and Suzi Aleksander Spring 2016
Using the Gene Ontology (GO) for analysis of expression data Jane Lomax EMBL-EBI 25th June 2007 Jane Lomax.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
Gene expression analysis
Annotating Gene Products to the GO
Insight into GO and GOA Angelica Tulipano , INFN Bari CNR
Statistical Testing with Genes
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Gene Annotation & Gene Ontology June 1, 2017

Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following information? Gene name or symbol Ratio between groups (UP or DOWN) One or more database IDs (accession numbers) How do you figure out the role of the genes in the model you are studying?

Gene annotation Process of assigning descriptions to a transcript or gene product. Includes: Official gene symbol & name Protein features: domains, functional elements such as nuclear localization signals Predicted molecular function, biological process and cellular location Experimentally derived information function, process and cellular location References ....

Who does the gene annotation? Refseq & Gene databases NCBI staff Ensemble databases http://useast.ensembl.org EMBL & Welcome Trust at Sanger Institute Uniprot Staff at European Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) Yeast DB, FlyBase, Mouse Genome Informatics (MGI) & other organism specific databases

Gene record for BEST1

Ensembl Gene record for BEST1

Uniprot record for BEST1

Gene, Ensembl or Uniprot? What information are you looking for? Comfort level with the interface All have a little to LOTS of information Use as a starting point

Dealing with gene lists How can you efficiently categorize the genes in in some biologically meaningful way? Batch download data from Gene or Uniprot and do a lot of reading? PubMed? One approach is to use meta-data in the form of terms assigned to each gene that describe its molecular function, participation in a biological process and its location in a cellular component

Gene Ontology Set of standard biological phrases (terms) which are applied to genes/proteins: protein kinase apoptosis Membrane Standardize the representation of gene product attributes across species and databases Maintained by Gene Ontology consortium http://geneontology.org/ Individual groups contribute taxonomic specific terms

Cellular Component Where a gene product acts Mitochondria

Cellular components not all same between organisms

Cellular Component Ribosome Enzyme complexes in the component ontology refer to places, not activities.

glucose-6-phosphate isomerase activity Molecular Function Activities or “jobs” of a gene product glucose-6-phosphate isomerase activity

insulin receptor activity Molecular Function insulin binding insulin receptor activity

Molecular Function A gene product may have several functions Sets of functions make up a biological process.

Biological Process a commonly recognized series of events cell division

Biological Process transcription

regulation of gluconeogenesis Biological Process regulation of gluconeogenesis

Biological Process limb development

Why use gene ontology? Allows biologists to make queries across large numbers of genes without researching each one individually Can find all the PI3 kinases in a given genome or find all proteins involved in oxidative stress response without prior knowledge of every gene

TARDBP (TDP-43) GO biological process: GO molecular function: 3’UTR mediated mRNA stabilization RNA splicing mRNA processing GO molecular function: RNA binding Double-stranded DNA binding mRNA 3’-UTR binding GO cellular component Cytoplasm Interchromatin granule Nuclear speck

Gene Ontology for analysis Ontology annotation is NOT complete Biological process terms are more useful for putting gene lists into a context More GO terms assigned to process than to function or component Fewest terms assigned to component Function in the absence of any process information can imply a biological role i.e. you are looking for transcription factors responsible for some response Ontology annotation is NOT complete

Ontology Structure Terms are linked by two relationships is-a  part-of 

mitochondrial chloroplast Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane membrane is-a part-of

Nucleic acid binding is a GO structure Nucleic acid binding is a type of binding. GO isn’t just a flat list of biological terms terms are related within a hierarchy is_a is_a DNA binding is a type of nucleic acid binding.

GO structure gene A A gene (A) that is associated with a term ‘DNA replication’ is automatically annotated to all that terms parent terms. A single gene associated with with a particular term is automatically annotated to all of the parent terms

GO structure This means genes can be grouped according to user-defined levels Allows broad overview of gene set or genome You can use the level of granularity that makes most sense

GO terms term: transcription initiation Each concept has: id: GO:0006352 definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. a name an ID number a definition

GO terms assigned to TARDBP

Types of evidence codes Experimental: Inferred from Experiment (EXP) Inferred from Direct Assay (IDA) Inferred from Physical Interaction (IPI) Inferred from Mutant Phenotype (IMP) Inferred from Genetic Interaction (IGI) Inferred from Expression Pattern (IEP)

Types of evidence codes Computational: Inferred from Sequence or structural Similarity (ISS) Inferred from Sequence Orthology (ISO) Inferred from Sequence Alignment (ISA) Inferred from Sequence Model (ISM) Inferred from Genomic Context (IGC) Inferred from Biological aspect of Ancestor (IBA) Inferred from Biological aspect of Descendant (IBD) Inferred from Key Residues (IKR) Inferred from Rapid Divergence(IRD) Inferred from Reviewed Computational Analysis (RCA)

Types of evidence codes Other: Author Statement Evidence Codes Traceable Author Statement (TAS) Non-traceable Author Statement (NAS) Curator Statement Evidence Codes Inferred by Curator (IC) No biological Data available (ND) Automatically-assigned Inferred from Electronic Annotation (IEA)

Manual annotation Molecular function Biological process In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… This is an example of how a curator might approach a paper to find GO terms. Biological process Cellular component

Electronic Annotation Annotation derived without human validation mappings file e.g. interpro2go, ec2go. Blast search ‘hits’ Lower ‘quality’ than manual codes Used in non-model organisms Define a similarity cut-off (E-value of 10-25) So electronic annotation is where a human hasn’t looked at an annotation, it’s been done entirely automatically. This can be from a mappings file e.g. InterPro2go, spkw2go, from non-validated sequence similarity, or from a combination of different methods. These electronic methods produce very large numbers of annotations, but because they are not individually validated by a curator, can be thought of as having a lower quality than curator approved annotations.

Quality of annotation varies by organism

GO & analysis of gene lists www.geneontology.org Maintains the databases of GO terms, serves a clearing house for terms as they are assigned in new organisms Tools for exploring gene lists using GO: WebGestalt, gProfiler, Onto-Express, and GSEA to name a few DAVID is a suite of tools for gene enrichment analysis that also includes GO. We’ll use both DAVID and WebGestalt to explore our gene list insert slides from sorin’s talk

Gene Ontology tools input a gene list shows which GO categories have most genes associated with them or are “enriched” provides a statistical measure to determine whether enrichment is significant

Using GO in practice statistical measure how likely your differentially regulated genes fall into that category by chance mitosis – 80/100 apoptosis – 40/100 Cell proliferation – 30/100 glucose transport – 20/100 The better ones include an statistical measure of how likely your differentially regulated genes fall into that category by chance So why is that necessary So imagine you do a microarray with a 1000 genes, and you find that 100 are differentially regualted And these are the GO processes that those differentially regualted genes fall into - it looks like mitosis is overrepresented…. microarray 1000 genes 100 genes differentially regulated experiment

Using GO in practice However, when you look at the distribution of all genes on the microarray: Proportions analysis Chi-squared or Fisher’s exact test Process Genes on array # genes expected (out of 100) # genes observed Mitosis 800/1000 80 Apoptosis 400/1000 40 Cell proliferation 100/1000 10 30 Glucose transport 50/1000 5 20 you can see that 80% of them were involved in mitosis, so the number upregulated is what you’d expect by chance. The category positive regulation of cell proliferation actually contains more differentially regualted genes than you would expect by chance Need a statistical test e.g. Chi-squared to see if this overrepresentation or enrichment of a certain class is statistically significant.

Other sources of annotation Uniprot (Swiss-Prot) keywords Protein domain databases PFAM, Panther, PDB, PROSITE, ect GeneDB summaries from NCBI Protein-protein interactions databases Pathway databases KEGG, BioCarta, BBID, Reactome DAVID incorporates annotation from all of these and clusters the redundant terms

Today in computer lab Tutorial on using DAVID Tutorial on using WebGestalt Analysis of gene lists using DAVID and WebGestalt