Download presentation
Presentation is loading. Please wait.
Published byMyrtle Kelly Modified over 9 years ago
1
Lecture 4: Gene Annotation & Gene Ontology June 11, 2015
2
Gene lists in a manuscript Official gene ID, symbol and name Fold-change Additional annotation –Role in cell, protein domain, predicted function, ect
3
p38MAPK-dependent factors are expressed in the TME of breast cancer (BC) lesions. Elise Alspach et al. Cancer Discovery 2014;4:716-729
4
Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following information? Gene name or symbol Ratio between groups (UP or DOWN) One or more database IDs (accession numbers) How do you figure out the role of the genes in the model you are studying?
5
Sequence databases GenbankEMBL-EBI GenPeptrEMBL Joe/Jill lab geek Hans/Heidi lab geek automatic translation error correction & limited annotation RefSeqUniprotKB/ trEMBL Expert annotation from literature Gene DBUniprotKB/ SwissProt Removal of seqs >90% identical NCBI NR DNA sequence DNA sequence Proteins ENSEMBL
6
Most common database IDs Refseq records (NCBI) –NM_ (mRNA) & NP_ (proteins), ~61 million records ENSEMBL records (EBI) –ENSG (gene), ENST (transcript), ENSP (proteins) Gene database (NCBI or Entrez) –Gene DB IDs are all digits, ranging in length from 2->10 –Started with human genes, ~11 million records Uniprot database (EMBL-EBI, SIB and PIR) –Q####, P####, ect; –Focused on high-quality annotations of proteins ~550,000 out of ~50 million proteins in TrEMBL
7
Identifying the genes For most downstream analyses, you will want to use a database ID rather than gene symbols Why? –Database IDs (Gene, Refseq, Uniprot) more stable –Less likely to be misinterpreted –Same gene symbol used in more than one organism Which database ID? –NCBI Gene DB (or Entrez Gene ID) –Uniprot (only proteins) –Refseq (redundancy from transcript isoforms) –ENSEMBL, less commonly used
8
Converting database IDs bioDBnet (db2db) http://biodbnet.abcc.ncifcrf.gov/ All types of gene products EBI BioMart (Tools -> id conversion) http://www.biomart.org/ Biased towards protein-coding DAVID bioinformatics (Gene ID conversion) http://david.abcc.ncifcrf.gov/ Biased towards protein-coding
9
Types of genes RNAseq not biased towards protein-coding genes, so you will get data from non-coding RNA, pseudo- genes and others. Can obtain data from bioDBnet and use Excel to categorize your list by type of gene product
10
Gene annotation Process of assigning descriptions to a transcript or gene product. Includes: –Official gene symbol & name –Protein features: domains, functional elements such as nuclear localization signals –Predicted molecular function, biological process and cellular location –Experimentally derived information function, process and cellular location –References –....
11
Who does the gene annotation? Refseq & Gene databases –NCBI staff Ensemble databases –http://useast.ensembl.orghttp://useast.ensembl.org –EMBL & Welcome Trust at Sanger Institute Uniprot –Staff at European Bioinformatics Institute (EBI), Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR) Yeast DB, FlyBase, Mouse Genome Informatics (MGI) & other organism specific databases
12
Gene record for BEST1
13
Ensembl Gene record for BEST1
14
Uniprot record for BEST1
15
Gene, Ensembl or Uniprot? What information are you looking for? Comfort level with the interface All have a little to LOTS of information Use as a starting point
16
Dealing with gene lists How can you efficiently categorize the genes in in some biologically meaningful way? Batch download data from Gene or Uniprot and do a lot of reading? PubMed? One approach is to use meta-data in the form of terms assigned to each gene that describe its molecular function, participation in a biological process and its location in a cellular component
17
Gene Ontology Set of standard biological phrases (terms) which are applied to genes/proteins: –protein kinase –apoptosis –Membrane Attempt to standardize the representation of genes and gene product attributes across species and databases Maintained by Gene Ontology consortium –http://geneontology.org/http://geneontology.org/ –Individual groups contribute taxonomic specific terms
18
Cellular Component Where a gene product acts Mitochondria
19
Cellular Component Cellular components of a virus different than a cell
20
Cellular Component Enzyme complexes in the component ontology refer to places, not activities.
21
Molecular Function Activities or “ jobs ” of a gene product glucose-6-phosphate isomerase activity
22
Molecular Function insulin binding insulin receptor activity
23
Molecular Function A gene product may have several functions Sets of functions make up a biological process.
24
Biological Process a commonly recognized series of events cell division
25
Biological Process transcription
26
Biological Process regulation of gluconeogenesis
27
Biological Process limb development
28
Biological Process courtship behavior
29
Why use gene ontology? Allows biologists to make queries across large numbers of genes without researching each one individually Can find all the PI3 kinases in a given genome or find all proteins involved in oxidative stress response without prior knowledge of every gene
30
From the Ex 2 gene list BEST1 –Bestrophin 1 –What is its role in the cell? Gene ontology biological process: –Chloride transmembrane transport –Regulation calcium ion transport –Visual perception GO molecular function: –Chloride channel activity GO cellular component –Basolateral plasma membrane –Chloride channel complex
31
CCL23 from Ex2 list Chemokine (C-C motif) ligand 23 Function: –chemokine receptor binding Processes include: –G-protein coupled receptor signaling –Cellular calcium homeostasis –Monocyte chemotaxis Component: –Extracellular space
33
Generally biological process terms are more useful for putting gene lists into a context There are more GO terms assigned to process than to function or component Fewest terms assigned to component Function in the absence of any process information can imply a biological role – i.e. you are looking for transcription factors responsible for some response
34
Ontology Structure Terms are linked by two relationships –is-a –part-of
35
Ontology Structure cell membrane chloroplast mitochondrial chloroplast membrane is-a part-of
36
is_a DNA binding is a type of nucleic acid binding. GO structure GO isn’t just a flat list of biological terms terms are related within a hierarchy Nucleic acid binding is a type of binding.
37
GO structure gene A A single gene associated with with a particular term is automatically annotated to all of the parent terms
38
GO structure This means genes can be grouped according to user-defined levels Allows broad overview of gene set or genome You can use the level of granularity that makes most sense
39
GO terms a name term: transcription initiation definition: Processes involved in the assembly of the RNA polymerase complex at the promoter region of a DNA template resulting in the subsequent synthesis of RNA from that promoter. a definition id: GO:0006352 an ID number Each concept has:
40
GO terms assigned to BEST1
41
Types of evidence codes Experimental:
42
Computational: Types of evidence codes
43
Other evidence codes Types of evidence codes
44
Manual annotation In this study, we report the isolation and molecular characterization of the B. napus PERK1 cDNA, that is predicted to encode a novel receptor-like kinase. We have shown that like other plant RLKs, the kinase domain of PERK1 has serine/threonine kinase activity, In addition, the location of a PERK1-GTP fusion protein to the plasma membrane supports the prediction that PERK1 is an integral membrane protein…these kinases have been implicated in early stages of wound response… Molecular function Cellular component Biological process
45
GO terms assigned to BEST1
46
Gene record for BEST1
47
Electronic Annotation Annotation derived without human validation –mappings file e.g. interpro2go, ec2go. –Blast search ‘ hits ’ Lower ‘ quality ’ than manual codes Used in non-model organisms
48
GO & analysis of gene lists Many tools exist that use GO to find common biological functions from a list of genes WebGestalt, gProfiler, Onto-Express, and GSEA to name a few Partek Genomics Suite has built-in GO enrichment We’ll use PGS and either the web-based WebGestalt or gProfiler as a comparison
49
GO tools input a gene list shows which GO categories have most genes associated with them or are “enriched” provides a statistical measure to determine whether enrichment is significant
50
Using GO in practice statistical measure –how likely your differentially regulated genes fall into that category by chance microarray 1000 genes experiment100 genes differentially regulated mitosis – 80/100 apoptosis – 40/100 Cell proliferation – 30/100 glucose transport – 20/100
51
Using GO in practice However, when you look at the distribution of all genes on the microarray: Proportions analysis –Chi-squared or Fisher’s exact test ProcessGenes on array # genes expected (out of 100) # genes observed Mitosis800/100080 Apoptosis400/100040 Cell proliferation100/10001030 Glucose transport50/1000520
52
Other sources of annotation Uniprot (Swiss-Prot) keywords Protein domain databases –PFAM, Panther, PDB, PROSITE, ect GeneDB summaries from NCBI Protein-protein interactions databases Pathway databases –KEGG, BioCarta, BBID, Reactome DAVID incorporates annotation from all of these and clusters the redundant terms
53
Today & next Tuesday in computer lab Managing gene lists with various online database tools Filtering your gene list from Ex. 2 so that you have only protein-coding genes and the database IDs or accession numbers you need for later analyses Tutorial on different tools for GO enrichment analysis Conduct GO enrichment on your list of genes using PGS, DAVID and one other GO tool (web based)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.