Lecture 4: Gene Annotation & Gene Ontology June 11, 2015.

Slides:



Advertisements
Similar presentations
Annotation of Gene Function …and how thats useful to you.
Advertisements

Applications of GO. Goals of Gene Ontology Project.
24th Feb 2006 Jane Lomax Gene Ontology tutorial Talk:Using the Gene Ontology (GO) for Expression Analysis Practical:Onto-Express analysis tool Talk: GO.
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Lecture Outline Introduction Data mining sources: –GO, InterPro, KEGG, UniProt Tools to do the data mining: –FatiGO –FatiWISE.
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Archives and Information Retrieval
COG and GO tutorial.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Protein Databases EBI – European Bioinformatics Institute
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute
Protein and Function Databases
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
GO Enrichment analysis COST Functional Modeling Workshop April, Helsinki.
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Automatic methods for functional annotation of sequences Petri Törönen.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Gene Ontology (GO) Project
GO : the Gene Ontology “because you know sometimes words have two meanings” Amelia Ireland GO Curator EBI, Cambridge, UK.
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
1 Bio-Trac 40 (Protein Bioinformatics) October 8, 2009 Zhang-Zhi Hu, M.D. Associate Professor Department of Oncology Department of Biochemistry and Molecular.
Part I: Identifying sequences with … Speaker : S. Gaj Date
Network & Systems Modeling 29 June 2009 NCSU GO Workshop.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Gene Ontology Project
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.
Manual GO annotation Evidence: Source AnnotationsProteins IEA:Total Manual: Total
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
24th Feb 2006 Jane Lomax GO Further. 24th Feb 2006 Jane Lomax GO annotations Where do the links between genes and GO terms come from?
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.
1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Rice Proteins Data acquisition Curation Resources Development and integration of controlled vocabulary Gene Ontology Trait Ontology Plant Ontology
Gene Ontology Consortium
Bioinformatics and Computational Biology
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Gene Ontology TM (GO) Consortium
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
Introduction to Genes and Genomes with Ensembl
Gene Annotation & Gene Ontology
GO : the Gene Ontology & Functional enrichment analysis
Mental Functioning and the Gene Ontology
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Ensembl Genome Repository.
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Lecture 4: Gene Annotation & Gene Ontology June 11, 2015

Gene lists in a manuscript Official gene ID, symbol and name Fold-change Additional annotation –Role in cell, protein domain, predicted function, ect

p38MAPK-dependent factors are expressed in the TME of breast cancer (BC) lesions. Elise Alspach et al. Cancer Discovery 2014;4:

Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following information? Gene name or symbol Ratio between groups (UP or DOWN) One or more database IDs (accession numbers) How do you figure out the role of the genes in the model you are studying?

Sequence databases GenbankEMBL-EBI GenPeptrEMBL Joe/Jill lab geek Hans/Heidi lab geek automatic translation error correction & limited annotation RefSeqUniprotKB/ trEMBL Expert annotation from literature Gene DBUniprotKB/ SwissProt Removal of seqs >90% identical NCBI NR DNA sequence DNA sequence Proteins ENSEMBL

Most common database IDs Refseq records (NCBI) –NM_ (mRNA) & NP_ (proteins), ~61 million records ENSEMBL records (EBI) –ENSG (gene), ENST (transcript), ENSP (proteins) Gene database (NCBI or Entrez) –Gene DB IDs are all digits, ranging in length from 2->10 –Started with human genes, ~11 million records Uniprot database (EMBL-EBI, SIB and PIR) –Q####, P####, ect; –Focused on high-quality annotations of proteins ~550,000 out of ~50 million proteins in TrEMBL

Identifying the genes For most downstream analyses, you will want to use a database ID rather than gene symbols Why? –Database IDs (Gene, Refseq, Uniprot) more stable –Less likely to be misinterpreted –Same gene symbol used in more than one organism Which database ID? –NCBI Gene DB (or Entrez Gene ID) –Uniprot (only proteins) –Refseq (redundancy from transcript isoforms) –ENSEMBL, less commonly used

Converting database IDs bioDBnet (db2db) All types of gene products EBI BioMart (Tools -> id conversion) Biased towards protein-coding DAVID bioinformatics (Gene ID conversion) Biased towards protein-coding

Types of genes RNAseq not biased towards protein-coding genes, so you will get data from non-coding RNA, pseudo- genes and others. Can obtain data from bioDBnet and use Excel to categorize your list by type of gene product

Gene annotation Process of assigning descriptions to a transcript or gene product. Includes: –Official gene symbol & name –Protein features: domains, functional elements such as nuclear localization signals –Predicted molecular function, biological process and cellular location –Experimentally derived information function, process and cellular location –References –....

Who does the gene annotation? Refseq & Gene databases –NCBI staff Ensemble databases – –EMBL & Welcome Trust at Sanger Institute Uniprot –Staff at European Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) Yeast DB, FlyBase, Mouse Genome Informatics (MGI) & other organism specific databases

Gene record for BEST1

Ensembl Gene record for BEST1

Uniprot record for BEST1

Gene, Ensembl or Uniprot? What information are you looking for? Comfort level with the interface All have a little to LOTS of information Use as a starting point

Dealing with gene lists How can you efficiently categorize the genes in in some biologically meaningful way? Batch download data from Gene or Uniprot and do a lot of reading? PubMed? One approach is to use meta-data in the form of terms assigned to each gene that describe its molecular function, participation in a biological process and its location in a cellular component

Gene Ontology Set of standard biological phrases (terms) which are applied to genes/proteins: –protein kinase –apoptosis –Membrane Attempt to standardize the representation of genes and gene product attributes across species and databases Maintained by Gene Ontology consortium – –Individual groups contribute taxonomic specific terms

Cellular Component Where a gene product acts Mitochondria

Cellular Component Cellular components of a virus different than a cell

Cellular Component Enzyme complexes in the component ontology refer to places, not activities.

Molecular Function Activities or “ jobs ” of a gene product glucose-6-phosphate isomerase activity

Molecular Function insulin binding insulin receptor activity

Molecular Function A gene product may have several functions Sets of functions make up a biological process.

Biological Process a commonly recognized series of events cell division

Biological Process transcription

Biological Process regulation of gluconeogenesis

Biological Process limb development

Biological Process courtship behavior

Why use gene ontology? Allows biologists to make queries across large numbers of genes without researching each one individually Can find all the PI3 kinases in a given genome or find all proteins involved in oxidative stress response without prior knowledge of every gene

From the Ex 2 gene list BEST1 –Bestrophin 1 –What is its role in the cell? Gene ontology biological process: –Chloride transmembrane transport –Regulation calcium ion transport –Visual perception GO molecular function: –Chloride channel activity GO cellular component –Basolateral plasma membrane –Chloride channel complex

CCL23 from Ex2 list Chemokine (C-C motif) ligand 23 Function: –chemokine receptor binding Processes include: –G-protein coupled receptor signaling –Cellular calcium homeostasis –Monocyte chemotaxis Component: –Extracellular space

Generally biological process terms are more useful for putting gene lists into a context There are more GO terms assigned to process than to function or component Fewest terms assigned to component Function in the absence of any process information can imply a biological role – i.e. you are looking for transcription factors responsible for some response

Ontology Structure Terms are linked by two relationships –is-a  –part-of 

Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane is-a part-of

is_a DNA binding is a type of nucleic acid binding. GO structure GO isn’t just a flat list of biological terms terms are related within a hierarchy Nucleic acid binding is a type of binding.

GO structure gene A A single gene associated with with a particular term is automatically annotated to all of the parent terms

GO structure This means genes can be grouped according to user-defined levels Allows broad overview of gene set or genome You can use the level of granularity that makes most sense

GO terms a name term: transcription initiation definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. a definition id: GO: an ID number Each concept has:

GO terms assigned to BEST1

Types of evidence codes Experimental:

Computational: Types of evidence codes

Other evidence codes Types of evidence codes

Manual annotation In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… Molecular function Cellular component Biological process

GO terms assigned to BEST1

Gene record for BEST1

Electronic Annotation Annotation derived without human validation –mappings file e.g. interpro2go, ec2go. –Blast search ‘ hits ’ Lower ‘ quality ’ than manual codes Used in non-model organisms

GO & analysis of gene lists Many tools exist that use GO to find common biological functions from a list of genes WebGestalt, gProfiler, Onto-Express, and GSEA to name a few Partek Genomics Suite has built-in GO enrichment We’ll use PGS and either the web-based WebGestalt or gProfiler as a comparison

GO tools input a gene list shows which GO categories have most genes associated with them or are “enriched” provides a statistical measure to determine whether enrichment is significant

Using GO in practice statistical measure –how likely your differentially regulated genes fall into that category by chance microarray 1000 genes experiment100 genes differentially regulated mitosis – 80/100 apoptosis – 40/100 Cell proliferation – 30/100 glucose transport – 20/100

Using GO in practice However, when you look at the distribution of all genes on the microarray: Proportions analysis –Chi-squared or Fisher’s exact test ProcessGenes on array # genes expected (out of 100) # genes observed Mitosis800/ Apoptosis400/ Cell proliferation100/ Glucose transport50/

Other sources of annotation Uniprot (Swiss-Prot) keywords Protein domain databases –PFAM, Panther, PDB, PROSITE, ect GeneDB summaries from NCBI Protein-protein interactions databases Pathway databases –KEGG, BioCarta, BBID, Reactome DAVID incorporates annotation from all of these and clusters the redundant terms

Today & next Tuesday in computer lab Managing gene lists with various online database tools Filtering your gene list from Ex. 2 so that you have only protein-coding genes and the database IDs or accession numbers you need for later analyses Tutorial on different tools for GO enrichment analysis Conduct GO enrichment on your list of genes using PGS, DAVID and one other GO tool (web based)