Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
1 Statistical Inference H Plan: –Discuss statistical methods in simulations –Define concepts and terminology –Traditional approaches: u Hypothesis testing.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Hypothesis Testing:.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Hypothesis Testing.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Gene Set Enrichment Analysis (GSEA)
+ Chapter 9 Summary. + Section 9.1 Significance Tests: The Basics After this section, you should be able to… STATE correct hypotheses for a significance.
Essential Statistics in Biology: Getting the Numbers Right
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Testing with Genes Saurabh Sinha CS 466.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Flat clustering approaches
The Broad Institute of MIT and Harvard Differential Analysis.
CuffDiff ran successfully. Output files include gene_exp.diff What are the next steps? Use Navigation bar to find files; they may be under DNA Subway if.
1 Paper Outline Specific Aim Background & Significance Research Description Potential Pitfalls and Alternate Approaches Class Paper: 5-7 pages (with figures)
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Module 2: Analyzing gene lists: over-representation analysis
a Cytoscape plugin to assess enrichment of
Canadian Bioinformatics Workshops
Clustering Manpreet S. Katari.
Canadian Bioinformatics Workshops
Differential Gene Expression
Statistical Testing with Genes
Canadian Bioinformatics Workshops
Significance analysis of microarrays (SAM)
The Omics Dashboard Suzanne Paley Pathway Tools Workshop 2018
Genesets and Enrichment
False discovery rate estimation
Functional enrichment of differentially expressed genes.
The Omics Dashboard.
Varying Intolerance of Gene Pathways to Mutational Classes Explain Genetic Convergence across Neuropsychiatric Disorders  Shahar Shohat, Eyal Ben-David,
Chapter 18 The Binomial Test
Statistical Testing with Genes
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Module 2 Finding over-represented pathways in gene lists

Module 2 bioinformatics.ca Enrichment Test Spindle Apoptosis Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

Module 2 bioinformatics.ca 5 Enrichment Analysis Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) 2.Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: – Where do the gene lists come from? – How to assess “surprisingly” (statistics) – How to correct for repeating the tests

Module 2 bioinformatics.ca Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test -Significance analysis of microarrays UP DOWN UP DOWN Selection by Threshold

Module 2 bioinformatics.ca Time-course Design Expression Matrix t1t1 t2t2 t3t3 …tntn Gene Clusters E.g.: - K-means - K-medoids - SOM

Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP) Background genes (array genes not significant)

Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP) Background genes (array genes not significant) Gene-set

Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene-set Gene list (e.g UP) Overlap between gene list and gene-set Background genes (array genes not significant)

Module 2 bioinformatics.ca Enrichment Test Significant genes (e.g UP) Overlap between gene list and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Random samples of array genes

Module 2 bioinformatics.ca 12 Fisher’s exact test a.k.a., the hypergeometric test Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected

Module 2 bioinformatics.ca 13 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 P-value Null distribution Answer = 4.6 x Fisher’s exact test a.k.a., the hypergeometric test

Module 2 bioinformatics.ca 14 Important details To test for under-enrichment of “black”, test for over- enrichment of “red”. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

Module 2 bioinformatics.ca Beyond Fisher’s Exact Test UP DOWN ENRICHMENT TEST ENRICHMENT TEST Gene list Fisher’s Test (Binomial and Gene list Fisher’s Test (Binomial and Ranked list (semi- quantitative) e.g. GSEA, WMW test KS test Ranked list (semi- quantitative) e.g. GSEA, WMW test KS test UP DOWN

Module 2 bioinformatics.ca Beyond Fisher’s Exact Test Possible problems with Fisher’s Exact Test – No “natural” value for the threshold – Different results at different threshold settings – Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected Solution: enrichment tests based on ranked lists

Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List

Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List The p-value depends only on a single gene-set The FDR corrects for multiple testing

Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43)

Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43)

Module 2 bioinformatics.ca GSEA: Method Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end? ES score calculation

Module 2 bioinformatics.ca GSEA: Method Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution ES score Gene rank ES score calculation

Module 2 bioinformatics.ca GSEA: Method MAX (absolute value) running ES score --> Final ES Score ES score calculation

Module 2 bioinformatics.ca GSEA: Method High ES score High local enrichment ES score calculation

Module 2 bioinformatics.ca GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Counts ES Score

Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value

Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value Randomized with ES ≥ real: 4 / > Empirical p-value = 0.002

Module 2 bioinformatics.ca Using GSEA Installation Launch Desktop Application from: Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time – (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

Module 2 bioinformatics.ca KS-test vs GSEA KS-test p-values can be directly computed without permutations (so faster) KS-test is older, used outside genomics GSEA has a visual appeal KS-test is not valid with small gene sets GSEA is a modified KS-test

Module 2 bioinformatics.ca Multiple test corrections

Module 2 bioinformatics.ca How to win the P-value lottery, part 1 Background population: 500 black genes, 4500 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws

Module 2 bioinformatics.ca How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotations RRP6 MRD1 RRP7 RRP43 RRP42

Module 2 bioinformatics.ca Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that any single one of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Module 2 bioinformatics.ca Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

Module 2 bioinformatics.ca False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value”

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation Sort P-values of all tests in decreasing order

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank …

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 Q-value = minimum adjusted P-value at given rank or below … …

Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRank FDR / Q-value … Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation x 53/1 = x 53/2 = x 53/3 = 0.35 … 0.04 x 53/50 = x 53/51 = x 53/52 = x 53/53 = 0.07 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold … P-value threshold for FDR < 0.05 FDR < 0.05? YYY…YNNNYYY…YNNN …

Module 2 bioinformatics.ca Reducing multiple test correction stringency The correction to the P-value threshold  depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories.

Module 2 bioinformatics.ca Summary Enrichment analysis: – Statistical tests Gene list: Fisher’s Exact Test Gene rankings: GSEA, KS-test – Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association

Module 2 bioinformatics.ca Lab Time Part 1 – Try out DAVID – use the yeast gene list and Module 2 gene list – Follow Lab 2 protocol – Part 2 – Continue with the Cytoscape protocol – Cytoscape Part 3 – Try out enrichment map – load the plugin from the plugin manager – Load DAVID results – or - load the GSEA enrichment analysis file - EM_EstrogenMCF7_TestData.zip (unzip) available at

Module 2 bioinformatics.ca If DAVID doesn’t recognize your genes, it can try to detect the correct identifiers to use Upload your gene list

Module 2 bioinformatics.ca Step 1 Gene ID mapping results

Module 2 bioinformatics.ca Run the enrichment analysis

Module 2 bioinformatics.ca Run the enrichment analysis

Module bioinformatics.ca We are on a Coffee Break & Networking Session