Lecture Outline Introduction Data mining sources: –GO, InterPro, KEGG, UniProt Tools to do the data mining: –FatiGO –FatiWISE.

Slides:



Advertisements
Similar presentations
Applications of GO. Goals of Gene Ontology Project.
Advertisements

Asking translational research questions using ontology enrichment analysis Nigam Shah
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
European Bioinformatics Institute The Gene Ontology Annotation (GOA) Database and enhancement of GO annotations through InterPro2GO Nicky Mulder
Gene Ontology John Pinney
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
Protein and Function Databases
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Daniel Rico, PhD. Daniel Rico, PhD. ::: Introduction to Functional Analysis Course on Functional Analysis Bioinformatics Unit.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Gene Set Enrichment Analysis (GSEA)
GO and OBO: an introduction. Jane Lomax EMBL-EBI What is the Gene Ontology? What is OBO? OBO-Edit demo & practical What is the Gene Ontology? What is.
The aims of the Gene Ontology project are threefold: - to compile vocabularies to describe components, functions and processes - to produce tools to query.
Networks and Interactions Boo Virk v1.0.
Intralab Workshop - Reactome CMAP Chang-Feng Quo June 29 th, 2006.
Biological Databases By : Lim Yun Ping E mail :
Fission Yeast Computing Workshop -1- Searching, querying, browsing downloading and analysing data using PomBase Basic PomBase Features Gene Page Overview.
The Gene Ontology project Jane Lomax. Ontology (for our purposes) “an explicit specification of some topic” – Stanford Knowledge Systems Lab Includes:
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Gene Ontology TM (GO) Consortium Jennifer I Clark EMBL Outstation - European Bioinformatics Institute (EBI), Hinxton, Cambridge CB10 1SD, UK Objectives:
BIOINFORMATIK I UEBUNG 2 mRNA processing.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
GO-based tools for functional modeling TAMU GO Workshop 17 May 2010.
PIRSF Classification System PIRSF: Evolutionary relationships of proteins from super- to sub-families Homeomorphic Family: Homologous proteins sharing.
The Gene Ontology and its insertion into UMLS Jane Lomax.
Protein and RNA Families
Mining Biological Data. Protein Enzymatic ProteinsTransport ProteinsRegulatory Proteins Storage ProteinsHormonal ProteinsReceptor Proteins.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Copyright OpenHelix. No use or reproduction without express written consent1.
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Motif discovery and Protein Databases Tutorial 5.
Statistical Testing with Genes Saurabh Sinha CS 466.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
MAPPING OF SEQUENCES TO GENE ONTOLOGY. GO consortium.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis
InterPro Sandra Orchard.
Tools in Bioinformatics Ontologies and pathways. Why are ontologies needed? A free text is the best way to describe what a protein does to a human reader.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Gene Ontology TM (GO) Consortium
Canadian Bioinformatics Workshops
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Gene Annotation & Gene Ontology May 24, Gene lists from RNAseq analysis What do you do with a list of 100s of genes that contain only the following.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
a Cytoscape plugin to assess enrichment of
Annotating with GO: an overview
GO : the Gene Ontology & Functional enrichment analysis
Statistical Testing with Genes
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Welcome to the Protein Database Tutorial
False discovery rate estimation
Statistical Testing with Genes
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Lecture Outline Introduction Data mining sources: –GO, InterPro, KEGG, UniProt Tools to do the data mining: –FatiGO –FatiWISE

Data mining Microarray results Microarray experiments are done to answer a biological question Results generate sets of numbers (intensities) which are then clustered to find data points of interest These themselves don’t necessarily answer the research question, these need to be converted to biological information first

Purpose of data mining Validation of results –understanding why these genes are grouped together Using biological information to find significant associations of biological terms to sets of genes Understanding of the roles of the genes at the molecular level

Data mining (1) -AB SB AA AC AB Add gene identifiers

Data mining (2) -AB SB AA AC AB RNA polymerase -Glycosyl hydrolase -Phosphofructokinase -Transcripiton factor -Glucose transporter Add gene descriptions

Data mining (3) -AB SB AA AC AB RNA polymerase -Glycosyl hydrolase -Phosphofructokinase -Transcripiton factor -Glucose transporter -GO GO GO GO GO Add GO terms

Data mining (4) -AB SB AA AC AB RNA polymerase -Glycosyl hydrolase -Phosphofructokinase -Transcripiton factor -Glucose transporter -GO GO GO GO GO Add functional annotation

Data mining (5) -AB SB AA AC AB Store results in database -RNA polymerase -Glycosyl hydrolase -Phosphofructokinase -Transcripiton factor -Glucose transporter -GO GO GO GO GO Map onto pathways

Sources of biological information Free text: e.g. Medline –Using text processing tools Curated repositories: e.g. GO, KEGG, UniProt, InterPro etc. –Using data mining –Using tools e.g. FatiGO and FatiWISE

Free text mining Advantages: –Vast amounts of data –Many associated terms for each gene Disadvantages: –Synonyms and acronyms –Context information –Irrelevant terms –Need to divide into entities and relationships to structure text

Example of problems The Sch9 protein kinase regulates Hsp90- dependent signal transduction activity in the budding yeast Saccharomyces cerevisiae. This interaction was suppressed by decreased signaling through the protein kinase A (PKA) signal transduction pathway. Text is unstructured –needs to be divided into entities and relationships

Example of problems The Sch9 protein kinase regulates Hsp90- dependent signal transduction activity in the budding yeast Saccharomyces cerevisiae. This interaction was suppressed by decreased signaling through the protein kinase A (PKA) signal transduction pathway. Protein Verb Pathway Acronym –could be used elsewhere for different gene Organism Some problems overcome using stats & better detection of entities and relationships Negative term used

Curated repositories These have reliable annotation Annotation is standardised They are usually well structured However, they usually have less annotation Examples: GenBank, GO (FatiGO), UniProt, InterPro, KEGG (FatiWISE)

Gene Ontology (GO) Many annotation systems are organism-specific or different levels of granularity GO introduced standard vocabulary first used for mouse, fly and yeast, but now generic An ontology is a formal specification of terms and relationships between them

GO Ontologies Molecular function: tasks performed by gene product –e.g. G-protein coupled receptor Biological process: broad biological goals accomplished by one or more gene products –e.g. G-protein signaling pathway Cellular component: part(s) of a cell of which a gene product is a component; includes extracellular environment of cells –e.g nucleus, membrane etc.

GO relationships “is-a” e.g. mitochondrial membrane is a membrane “part of” e.g. nuclear membrane is part of nucleus DAG structure

Current Mappings to GO Consortium mappings -MGD, SGD, RGD, FlyBase, TAIR GOA (Gene Ontology Anotation): Swiss-Prot keywords EC numbers InterPro entries Manual mappings Unigene Medline ID mappings, etc. FatiGO Evidence codes NB

GO Slim “Slimmed down” version of GO ontologies Selection of high level terms covering all or most biological functions processes and cell locations Many different GO Slim’s available with different depths and detail Used to make comparisons between annotated gene/protein sets easier (each gene may be mapped to different granularity)

Applications of GO slim

GO consortium page

UniProt annotation Protein sequence database from EMBL translations and direct sequencing Structured into specific fields e.g. description, comments, feature table, keywords Each field may have controlled vocabulary or specific syntax Swiss-Prot is well annotated, TrEMBL is not, and may have less structured text

Example Swiss- Prot entry Annotation

KEGG Kyoto Encyclopedia of Genes and Genomes –Molecular interaction networks in biological processes -PATHWAY database –Genes and proteins -GENES/SSDB/KO databases –Chemical compounds and reactions - COMPOUND/GLYCAN/REACTION databases Includes most organisms and info on orthologues

Example KEGG entry

InterPro Integrates protein signature databases e.g. Pfam, PROSITE, Prints etc. Classifies proteins into families and domains and lists all UniProt proteins belonging to each Provides annotation on the family/domain and links to 3D structure, GO, Enzyme Classification Used to functionally characterise a protein

Example InterPro entry

FatiGO Connecting microarray results with these biological data sources –answers questions e.g do my differentially expressed genes have different functions? FatiGO is used to extract relevant GO terms for a group of genes with respect to a set of reference genes (the rest) Can be used to list proportions of GO terms in a set of genes

FatiGO data sources Uses tables of correspondences between genes and their GO terms (human, mouse, Drosophila, yeast, worm and UniProt proteins –curated if possible) Uses genes from GenBank, UniProt (Swiss- Prot/TrEMBL), Ensembl etc. Problem in lack of standardisation of names –use EBI xrefs to link them, and for other databases they use their own gene IDs For GO associations they include GO evidence codes, e.g. IEA

Using the GO hierarchy Different levels in the GO hierarchy can be chosen, depending on specificity required FatiGO suggest using level 3 –questionable? Deeper you go (more specific) –fewer genes annotated to the terms Once level is set, for each gene FatiGO moves up hierarchy until set level is reached –increases no. of terms mapped to this level –easier to find relevance in different distributions of GO terms Repeated genes are counted once

How FatiGO works Given two sets of genes, and selected GO level Retrieves GO terms for each gene on correct level Applies Fisher’s exact test for 2x2 contingency tables for comparing 2 sets of genes (to get p-values) Extracts GO terms with significantly different distributions After correcting for multiple testing, provides adjusted p- values for 3 tests: –Step-down minP method (Westfall and Young) –FDR independent (Benjamini & Hochberg) –FDR arbitrary dependent (Benjamini & Yekutieli )

Testing sets of GO terms Gene set 1 Gene set 2 Transport 60% Regulation 20% Transport 20% Significantly higher distribution in 1 than 2 Same distribution Set 1Set 2 Observed difference and possible stronger differences

Multiple testing P-value: is the probability, under the null hypothesis of obtaining the observed result or a more extreme result than one observed Testing multiple null hypotheses (one per GO term) that there is no difference in the frequency of terms in each set For 1 test, type I error rate (probability of rejecting a true null hypothesis) is 0.05, but for multiple tests this increases - Family wise error rate (probability that one or more of rejected nulls are true ) Multiple testing allows controlling of Family Wise Error Rate (FWER) and False discovery rate (FDR)

Step down min-P method Controls FWER Procedure with a test statistic equivalent to Fisher's exact test for 2x2 contingency tables No. of random permutations set at Examines how many of the permuted p-values are smaller than the one under consideration Adjusted p-value for hypothesis H is level of entire test set procedure at which H would be rejected, given values of all test statistics involved

Controlling False Discovery Rate Tends to be more liberal than controlling FWER Controlling expected no. of false rejections (Type 1 errors) among rejected hypotheses Consider the proportions of erroneous rejections to the total number of rejections. Average value of proportion = FDR FDR can be dependent on or independent of test statistics, FatiGO gives: adjusted p-value using the FDR method of Benjamini & Hochberg –control of FDR under independence adjusted p-value using the FDR method of Benjamini & Yekutieli –control of FDR under arbitrary dependent structures

Using FatiGO -Input Search for Unigene cluster ID, or specific gene IDs Input results from SotaTree or Pomelo Or input Excel or text file with list of gene or protein IDs, each on a new line Input reference set of genes Select GO ontology and level (inclusive) Select whether multiple test should include adjusted p- values for minP test

FatiGO interface (1)

FatiGO interface (2)

FatiGO output FatiGO returns four columns: the unadjusted p-value (p- value from Fisher’s exact test without adjusting for multiple comparisons) and adjusted p-values based on the three methods Results are ordered by increasing value of the adjusted p- value, facilitating the selection of GO terms with the most significant differences. P-value of –some evidence, –strong evidence and < –very strong evidence against null

Query set Reference set FatiGO example output Unadjusted p-value FRD (indep) adjusted FDR (depend) adjusted

Link to AmiGO

Other features of FatiGO You can input a list of genes and extract the GO terms sorted by percentages You can use GO results as a way to find differentially expressed genes –see if after correcting for multiple testing, some GO terms are overrepresented (provides more resolution where p-value has no meaning)

Percentages of GO terms within a set of genes

FatiWISE Data mining to retrieve additional biological info on InterPro motifs, KEGG pathways and Swiss- Prot keywords Uses Fishers exact test for 2x2 contingency tables for comparing two sets of genes and finding significantly different distributions Corrects for multiple testing to get adjusted p-value Can get stats for one set of genes or compare 2 sets

FatiWISE input and output Data sources: KEGG, InterPro, UniProt Input: –one or two sets of genes –Selection of organism (for pathway) Output: –Unadjusted p-value –Step-down min P adjusted p-value –FDR (arbitrary dependent) adjusted p-value

FatiWISE interface

FatiWISE InterPro output

FatiWISE KEGG output

FatiWISE keyword output

Summary Data mining is used to bring the biology into results Curated data sources are the best for this, due to structure and controlled vocabulary FatiGO and FatiWISE are simple web tools enabling data mining on 1 or 2 sets of genes Exercises:

Websites for Annotation Webgestalt: Fatigo:

Websites for Sequence Analysis and Motif Finding Martview: TOUCAN: _Tutorial_Overview.html _Tutorial_Overview.html SeqVista: Mitra: Spex: Gene Expression Analysis: