Extracting Biological Information from Gene Lists

Slides:



Advertisements
Similar presentations
Microarray statistical validation and functional annotation
Advertisements

Asking translational research questions using ontology enrichment analysis Nigam Shah
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Introduction to Microarry Data Analysis - II BMI 730
Gene Ontology John Pinney
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Gene ontology & hypergeometric test Simon Rasmussen CBS - DTU.
Applications of blast all against all Global analysis! e.g: Evolutionary-oriented analysis of protein families (e.g., identify protein families that are.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Introduction to BioInformatics GCB/CIS535
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)
Internet tools for genomic analysis: part 2
Analysis of GO annotation at cluster level by H. Bjørn Nielsen Slides from Agnieszka S. Juncker.
Gene Ontology and Functional Enrichment Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.
Pathway analysis Daniel Hurley Pathway analysis: summary A popular buzzword… but what does it mean? A popular buzzword… but what does it mean? How do.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
MN-B-C 2 Analysis of High Dimensional (-omics) Data Kay Hofmann – Protein Evolution Group Week 5: Proteomics.
Automatic methods for functional annotation of sequences Petri Törönen.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Gene Set Enrichment Analysis (GSEA)
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Networks and Interactions Boo Virk v1.0.
Inferring Function From Known Genes Naomi Altman Nov. 06.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Presenting Results Laura Biggins v1.0 1.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Ontologies GO Workshop 3-6 August Ontologies  What are ontologies?  Why use ontologies?  Open Biological Ontologies (OBO), National Center for.
Introduction to Microarrays Dr. Özlem İLK & İbrahim ERKAN 2011, Ankara.
Analysis of GO annotation at cluster level by Agnieszka S. Juncker.
Data Mining the Yeast Genome Expression and Sequence Data Alvis Brazma European Bioinformatics Institute.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR
GO enrichment and GOrilla
Copyright OpenHelix. No use or reproduction without express written consent1.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
Clench 2.0 A program for cluster enrichment analysis and integrated visualization of expression, annotation and transcription factor binding site data.
Canadian Bioinformatics Workshops
Extracting Biological Information from Gene Lists
Module 2: Analyzing gene lists: over-representation analysis
a Cytoscape plugin to assess enrichment of
Networks and Interactions
Exploring and Presenting Results
Clustering Manpreet S. Katari.
Statistical Testing with Genes
Canadian Bioinformatics Workshops
Gene-set analysis Danielle Posthuma & Christiaan de Leeuw
Artefacts and Biases in Gene Set Analysis
Analysis of GO annotation at cluster level by Agnieszka S. Juncker
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.
Artefacts and Biases in Gene Set Analysis
Exploring and Presenting Results
The Omics Dashboard.
Statistical Testing with Genes
Presentation transcript:

Extracting Biological Information from Gene Lists Simon Andrews, Laura Biggins, Boo Virk simon.andrews@babraham.ac.uk laura.biggins@babraham.ac.uk boo.virk@babraham.ac.uk v1.0

Analysis doesn’t end here! Analysis of processed sample: Data acquisition – sequencing, microarray analysis, mass spectrometry Biological material Sample for analysis Isolation of DNA, RNA or proteins Sample processing Analysis doesn’t end here! Raw data file(s) Results Table Containing hits – genes, transcripts or proteins Public databases Data analysis: identification of genes, transcripts or proteins

Why functional analysis? Advantages: Biological insight Validation of experiment Generate new hypothesis Limitations: Amount of information depends on the species Will only find known/published links between genes If working on something novel – information available may be limited

What this course covers Morning Introduction to Gene Lists Gene List Practical Coffee Presenting results Presenting Results Practical Afternoon Motif Searching Motif Searching Practical Coffee Networks and Interactions Network Practical Commercial tools

Gene Lists Types of gene list: Names of genes Names of genes ordered by qualitative value Names of genes ordered by quantitative value Gene lists can be ranked P-value Other Stat Ordered Gene lists can be filtered Cut off point Subset of genes

Transforming Gene ID’s Need to use relevant ID to extract information from databases http://www.biomart.org/ BioMart ID conversion tool allows us to do this easily and quickly online

Download this data, import transformed ID’s into table

I have my gene list, what next? Hyperlinked table: Gene UniProt Name Score Reactome TP53 P04637 Tumor Suppressor p53 125527 CDK1 P06493 Cyclin-dependent kinase 1 113740 POLE Q07864 DNA Polymerase Epsilon 107190 KPNB1 Q14974 Importin subunit beta-1 35542 CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 AURKB Q96GD4 Aurora kinase B 30803 RPA2 P15927 Replication protein A 32 kDa subunit 22207 CDT1 Q9H211 DNA replication factor Cdt1 21735 MCMBP Q9BTE3 MCM complex-binding protein 17811 TUBG1 P23258 Tubulin gamma-1 chain 16895 RAN P62826 GTP-binding nuclear protein Ran 16384 RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 BLM P54132 Bloom syndrome protein 14883 PCNA P12004 Proliferating Cell Nuclear Antigen 13982 SETD8 Q9NQR1 Pr-Set7 13711 RCC1 P18754 Regulator of chromosome condensation 13302 MCM5 P33992 DNA replication licensing factor MCM5 12806 CDC25C P30307 M-phase inducer phosphatase 3 12510 PLK1 P53350 Serine/threonine-protein kinase PLK1 10930 MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210

Hyperlinked tables Advantages However… Easy to create –no special data-mining software needed One-click direct access to relevant pages Reference resource However… Need to become familiar with the resources available - tailor hyperlinks to be specific for your organism and questions being asked Information on one gene at a time in your gene set Need to get relevant resource ID’s

I have my gene list, what next? Annotated table: Gene UniProt Name Score PANTHER GO-Slim BP CHEK1 O14757 Serine/threonine-protein kinase Chk1 35271 apoptotic process;nitrogen compound metabolic process;biosynthetic process;transcription from RNA polymerase II promoter;cellular protein modification process;cell cycle;cell communication;apoptotic process;response to stress;response to abiotic stimulus;regulation of transcription from RNA polymerase II promoter;regulation of cell cycle;chromatin organization MCM5 P33992 DNA replication licensing factor MCM5 12806 cell cycle;cell communication RCC1 P18754 Regulator of chromosome condensation 13302 cellular component movement;mitosis;chromosome segregation;cellular component morphogenesis;intracellular protein transport;cellular component organization CDC25C P30307 M-phase inducer phosphatase 3 12510 DNA replication;cell cycle PLK1 P53350 Serine/threonine-protein kinase PLK1 10930 DNA replication;DNA repair;DNA recombination;cell cycle TP53 P04637 Tumor Suppressor p53 125527 glycogen metabolic process;protein phosphorylation;mitosis;cell communication CDK1 P06493 Cyclin-dependent kinase 1 113740 nitrogen compound metabolic process;biosynthetic process;DNA replication;RNA metabolic process;cellular process;regulation of biological process;regulation of catalytic activity BLM P54132 Bloom syndrome protein 14883 nucleobase-containing compound metabolic process;cell cycle;cell communication;RNA localization;intracellular protein transport;nuclear transport RPA2 P15927 Replication protein A 32 kDa subunit 22207 nucleobase-containing compound metabolic process;mitosis;nucleobase-containing compound transport;regulation of catalytic activity TUBG1 P23258 Tubulin gamma-1 chain 16895 phosphate-containing compound metabolic process;cellular protein modification process;cell cycle KPNB1 Q14974 Importin subunit beta-1 35542 phosphate-containing compound metabolic process;protein phosphorylation;cytokinesis;cell cycle;regulation of cell cycle;chromatin organization;cytoskeleton organization MZT1 Q08AG7 Mitotic-spindle organizing protein 1 9210 protein targeting;nuclear transport PCNA P12004 Proliferating Cell Nuclear Antigen 13982 RAN P62826 GTP-binding nuclear protein Ran 16384 POLE Q07864 DNA Polymerase Epsilon 107190 AURKB Q96GD4 Aurora kinase B 30803 MCMBP Q9BTE3 MCM complex-binding protein 17811 CDT1 Q9H211 DNA replication factor Cdt1 21735 RANGRF Q9HD47 Ran guanine nucleotide factor Mog1 15527 SETD8 Q9NQR1 Pr-Set7 13711

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

What is Gene Ontology (GO)? Collaborative effort addressing need for consistent descriptions of gene products across different databases GO project has three structured ontologies describing gene products independent of species: Biological Processes (BP), Cellular Components (CC) Molecular Functions (MF)

GO Structure 3 GO domains: Root ontology terms general Parent specific 1 2 3 Root ontology terms general specific Parent Child

Subsets of GO terms GO slim terms: GO fat terms: Cut-down versions of the GO ontologies that contain a subset of terms from the GO resource Give a broad overview of the ontology content without the detail of the specific, fine-grained terms GO fat terms: subset comprising more specific terms

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

Pathway and Interactions Are specific pathways enriched in my list? What other genes are in this pathway? Which genes/gene products interact with my genes of interest? Databases include:

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

Protein Domain Can I find shared protein domains? What is the function of shared domain? Which other proteins share this domain? Databases include:

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

Co-expression Which genes are co-expressed? Automatic grouping of genes (rather than human curation (GO)) Databases include:

Annotation Sources There are many databases for annotation sources, including (but not limited to): Gene Ontology (GO)(most popular) Pathways and Interactions Protein domains Co-expression Transcription binding sites

Annotated tables Advantages Information on function from larger gene sets Sort groups of genes (GO term, pathway, protein domain) Relatively easy to create Reference resource However… Need to become familiar with the resources specific to your research Lots of information can be difficult to sort efficiently

What does functional information tell me? Have functional information about a gene set Can verify genes implicated in experiment are functionally relevant, and to discover unexpected shared functions Determine which functions are enriched in gene set How? Compare to a background list of genes

What is a background list? In theory, any gene that could have been differentially expressed in your experiment RNA seq – all genes apart from those with less than 10-20 reads Arrays – all genes in the array ChipSeq - any gene on the chip. vs

Choosing a Background List Which background list to use? Whole set of genes Tissue/cell specific genes Manually made list, derived from your experiment and analysis vs ? Choice of background list place where things are likely to go wrong. Experimental design and making the most appropriate comparison Manual list e.g. based on expression level

Statistics to test for enrichment 13,101 genes on chip Related to disease 260/747 = 34.8% Gene List Are these proportions the same? 3005 genes related to disease 3005/13,101= 23.1% Do not related to disease 487/747 = 65.2%

Lots of Statistical tests to choose from • Hypergeometric test • Fisher’s exact/Chi-squared • Binomial • Kolmogorov Smirnov • Permutation

Hypergeometric test Uses hypergeometric distribution to measure the probability of having drawn a specific number of successes (out of a total number of draws) from a population Example: Imagine that there are 4 green and 16 red marbles in a box. You close your eyes and draw 5 marbles without replacement What is the probability that exactly 2 of the 5 are green?

Are these proportions the same? Gene List 3005 genes map to disease 3005/13,101= 23.1% 13,101 genes on chip Map to disease 260/747 = 34.8% Do not map to disease 487/747 = 65.2% Are these proportions the same? What is the probability (p-value) that exactly 260 genes (out of 747) map to disease, given that there are 3005 of those genes in the background (13,101 genes)?

Hypergeometric test Limitations: Assumes independence of categories Input Sample Size Output Specifics Hypergeometric Unranked/Ranked List Large (5% of background) P-value Finite population – probability of success changes Limitations: Assumes independence of categories Result terms often include directly related terms Is there really evidence for both terms? Works better with larger samples (5% of background)

Based on Hypergeometric test: Input Sample Size Output Specifics Hypergeometric Unranked/Ranked List Large (5% of background) P-value Finite population – probability of success changes Based on Hypergeometric test: Test Input Sample Size Output Specifics Fisher’s Exact Unranked/Ranked List Small P-value Can be used to compare 2 conditions as well as gene list to background one-tailed or two-tailed Binomial Large Does not assume finite population – probability of success remains the same

Limitations of Fisher’s Exact and Binomial test Neither account for variation in the number of genes annotated to individual terms/functions being tested or the number of terms/functions associated with individual genes Therefore, tend to over-estimate significance if the gene set has an unusually high number of annotations Assume independence of categories

Lots of Statistical tests to choose from • Hypergeometric • Fisher’s exact/Chi-squared • Binomial • Kolmogorov Smirnov • Permutation Used for ranked gene lists only Output: enrichment scores (ES) for functions, which can then be translated into a p-value

Multiple testing correction Error types in statistics: Statistical Decision: True state in Gene List Not Overrepresented Overrepresented Significant Type I error (False Positive) Correct Not Significant Type II error (False Negative) Traditionally, a test or a difference are said to be “significant” if the probability of type I error is: α =< 0.05

Probability of error increases from 5% to 14.3% Example: You want to compare 3 groups and you carry out 3 hypergeometric tests, each with a 5% level of significance (P<0.05) Probability of not making type I error = 95% = (1 – 0.05) Overall probability of no type I errors is: 0.95 * 0.95 * 0.95 = 0.857 Therefore probability of at least one type I error is: 1-0.857 = 0.143 or 14.3% If comparing 5 groups instead of 3, the multiple testing error rate is 40%! (=1-(0.95)n) Solution for multiple comparisons: Multiple testing correction Probability of error increases from 5% to 14.3%

Multiple test corrections Bonferroni Significant level (e.g. 0.05) /number of tests = new threshold This is an over correction if tests are correlated Benjamini-Hochberg Rank the p-values Apply more stringent correction to the most significant, and least stringent to the least significant p-values

Statistical issues • We want to Identify functions of maximal biological significance – BUT this is not perfectly correlated with statistical significance • Use p‐values as a tool to rank functions but don’t take them too literally • Need to correct for multiple testing

Tools for functional gene list analysis There are many different tools available, both free and commercial Popular web-based tools include:

PANTHER (Protein ANnotation THrough Evolutionary Relationship) http://www.pantherdb.org/ One of the most widely used online resources for gene function classification and genome wide data analysis PANTHER users have successfully analysed data from: Gene expression Proteomics Genome-wide association study (GWAS) experiments PANTHER is part of the GO consortium, thus PANTHER annotation = up to date GO curation

PANTHER for functional classification

Send list to > File Saves table in a tab delimited .txt file

PANTHER for statistics

Annotations from PANTHER include: GO-slim terms PANTHER “protein class” PANTHER “Pathway” terms Doesn’t cluster together genes with similar GO terms in table Statistics: Binomial test with Bonferroni multiple testing correction

https://david.ncifcrf.gov/ Gathers data from many different databases – this is customisable Functional Clustering Uses many annotations, including GO-Fat terms – more specific set of GO terms Statistics: Fisher’s Exact Test and multiple testing correction

DAVID for functional classification

Functional Clustering Enrichment score for the whole cluster rather than individual functions, DAVID anything above 2 or 3 is considered as enriched

Which DAVID tool should I use?

GOrilla http://cbl-gorilla.cs.technion.ac.il/

Which tool to use? Choose a tool that: – Includes your gene / probe identifiers – Includes your species – Has up‐to‐date annotation – Lets you define your background (if possible) – Try a few different tools – Try gene lists of varying length