Canadian Bioinformatics Workshops

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

1 Chapter 20: Statistical Tests for Ordinal Data.
1 Statistics Achim Tresch UoC / MPIPZ Cologne treschgroup.de/OmicsModule1415.html
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Introduction to Microarry Data Analysis - II BMI 730
Data mining with the Gene Ontology Josep Lluís Mosquera April 2005 Grup de Recerca en Estadística i Bioinformàtica GOing into Biological Meaning.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Using Gene Ontology Models and Tests Mark Reimers, NCI.
Differentially expressed genes
Statistical Analysis of Microarray Data
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Analysis of Differential Expression T-test ANOVA Non-parametric methods Correlation Regression.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.
Chapter 14 Inferential Data Analysis
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Inferential Statistics
False Discovery Rate (FDR) = proportion of false positive results out of all positive results (positive result = statistically significant result) Ladislav.
Hypothesis Testing:.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Multiple testing correction
Multiple testing in high- throughput biology Petter Mostad.
Hypothesis Testing.
Jeopardy Hypothesis Testing T-test Basics T for Indep. Samples Z-scores Probability $100 $200$200 $300 $500 $400 $300 $400 $300 $400 $500 $400.
Inferential Stats, Discussions and Abstracts!! BATs Identify which inferential test to use for your experiment Use the inferential test to decide if your.
Gene Set Enrichment Analysis (GSEA)
Essential Statistics in Biology: Getting the Numbers Right
Introduction To Biological Research. Step-by-step analysis of biological data The statistical analysis of a biological experiment may be broken down into.
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Significance analysis of microarrays (SAM) SAM can be used to pick out significant genes based on differential expression between sets of samples. Currently.
Suppose we have analyzed total of N genes, n of which turned out to be differentially expressed/co-expressed (experimentally identified - call them significant)
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
Multiple Testing in Microarray Data Analysis Mi-Ok Kim.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
A A R H U S U N I V E R S I T E T Faculty of Agricultural Sciences Introduction to analysis of microarray data David Edwards.
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Back to basics – Probability, Conditional Probability and Independence Probability of an outcome in an experiment is the proportion of times that.
Multiple Testing Matthew Kowgier. Multiple Testing In statistics, the multiple comparisons/testing problem occurs when one considers a set of statistical.
Statistical Testing with Genes Saurabh Sinha CS 466.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Suppose we have T genes which we measured under two experimental conditions (Ctl and Nic) in n replicated experiments t i * and p i are the t-statistic.
Chapter 6: Analyzing and Interpreting Quantitative Data
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
Getting the story – biological model based on microarray data Once the differentially expressed genes are identified (sometimes hundreds of them), we need.
The Broad Institute of MIT and Harvard Differential Analysis.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 6 –Multiple hypothesis testing Marshall University Genomics.
Distinguishing active from non active genes: Main principle: DNA hybridization -DNA hybridizes due to base pairing using H-bonds -A/T and C/G and A/U possible.
Chapter 13 Understanding research results: statistical inference.
A Quantitative Overview to Gene Expression Profiling in Animal Genetics Armidale Animal Breeding Summer Course, UNE, Feb Analysis of (cDNA) Microarray.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Micro array Data Analysis. Differential Gene Expression Analysis The Experiment Micro-array experiment measures gene expression in Rats (>5000 genes).
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Canadian Bioinformatics Workshops
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Canadian Bioinformatics Workshops
Module 2: Analyzing gene lists: over-representation analysis
Differential Gene Expression
Statistical Testing with Genes
Canadian Bioinformatics Workshops
Functional enrichment of differentially expressed genes.
Statistical Testing with Genes
Presentation transcript:

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Module 2 Finding over-represented pathways in gene lists Quaid Morris Pathway and Network Analysis of –omics Data June 4-6, 2014 http://morrislab.med.utoronto.ca

Learning Objectives of Module 2 Be able to select the appropriate enrichment test for your data. Be able to determine the appropriate background gene list when running Fisher’s Exact Test (aka Hypergeometric test). Be able to determine when you need a multiple test correction. Be able to select whether to use a Bonferroni corrected P-value or a false discovery rate. Be able to explain, in plain language, how you calculate each correction. Understand how P-values can be computed by resampling.

Outline Introduction to enrichment analysis Hypergeometric Test, aka Fisher’s Exact Test GSEA enrichment analysis for ranked lists. Multiple test corrections: Bonferroni correction False Discovery Rate computation using Benjamini-Hochberg procedure

Types of enrichment analysis Gene list (e.g. expression change > 2-fold) Answers the question: Are any gene sets surprisingly enriched (or depleted) in my gene list? Statistical test: Fisher’s Exact Test (aka Hypergeometric test) Ranked list (e.g. by differential expression) Answers the question: Are any gene set ranked surprisingly high or low in my ranked list of genes? Statistical test: GSEA (+ others we won’t discuss)

Enrichment Test Microarray Experiment (gene expression table) Enrichment Table Spindle 0.00001 Apoptosis 0.00025 ENRICHMENT TEST Gene-set Databases

Gene list enrichment analysis Given: Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: Where do the gene lists come from? How to assess “surprisingly” (statistics) How to correct for repeating the tests

Two-class design for gene lists Genes Ranked by Differential Statistic Selection by Threshold Expression Matrix UP UP DOWN DOWN Class-1 Class-2 E.g.: Fold change Log (ratio) t-test Significance analysis of microarrays

Time-course design for gene lists Clusters E.g.: K-means K-medoids SOM Expression Matrix Each cluster is a separate gene list t1 t2 t3 … tn

Example gene list enrichment test (e.g UP-regulated) Microarray Experiment (gene expression table) Gene-set Databases Background (all genes on the array)

Example gene list enrichment test (e.g UP-regulated) Microarray Experiment (gene expression table) Gene-set Gene-set Databases Background (all genes on the array)

Enrichment Test The output of an enrichment test is a P-value Random samples of array genes The output of an enrichment test is a P-value The P-value assesses the probability that the overlap is at least as large as observed by random sampling the array genes.

Recipe for gene list enrichment test Step 1: Define your gene list and your background list, Step 2: Select your gene sets to test for enrichment, Step 3: Run enrichment tests and correct for multiple testing, if necessary, Step 4: Interpret your enrichments Step 5: Publish! ;) Consequtive basepairs

Why test enrichment in ranked lists? Possible problems with gene list test No “natural” value for the threshold Different results at different threshold settings Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected

Example ranked list enrichment test Gene List Enrichment Table Gene-set p-value Spindle 0.0001 Apoptosis 0.025 GSEA Gene-set Databases

Recipe for gene list enrichment test Step 1: Rank your gene list, Step 2: Select your gene sets to test for enrichment, Step 3: Run enrichment tests and correct for multiple testing, if necessary, Step 4: Interpret your enrichments Step 5: Publish! ;) Consequtive basepairs

Outline of theory component Hypergeometric test for calculating enrichment P-values for gene lists GSEA for computing enrichment P-values for ranked lists Multiple test corrections: Bonferroni Benjamini-Hochberg FDR

The hypergeometric test a.k.a., Fisher’s exact test Gene list Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected RRP6 MRD1 RRP7 RRP43 RRP42 Background population: 500 black genes, 4500 red genes

The hypergeometric test a.k.a., Fisher’s exact test Gene list Null distribution RRP6 MRD1 RRP7 RRP43 RRP42 P-value Answer = 4.6 x 10-4 Background population: 500 black genes, 4500 red genes

2x2 contingency table for Fisher’s Exact Test Gene list In gene list Not in gene list In gene set 4 496 Not in gene set 1 4499 RRP6 MRD1 RRP7 RRP43 RRP42 e.g.: http://www.graphpad.com/quickcalcs/contingency1.cfm Background population: 500 black genes, 4500 red genes

Important details To test for under-enrichment of “black”, test for over-enrichment of “red”. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

Other enrichment tests UP Gene list Fisher’s Exact Test, Binomial and Chi-squared. ENRICHMENT TEST DOWN UP Ranked list (semi-quantitative) GSEA, Wilcoxon ranksum, Mann-Whitney U, Kolmogorov-Smirnov DOWN

GSEA: Method Steps Calculate the ES score Generate the ES distribution for the null hypothesis using permutations see permutation settings Calculate the empirical p-value Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)

GSEA: Method ES score calculation Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end?

GSEA: Method ES score calculation Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution ES score 1 1001 4001 Gene rank

GSEA: Method ES score calculation MAX (absolute value) running ES score --> Final ES Score

GSEA: Method ES score calculation High ES score <--> High local enrichment

GSEA: Method Empirical p-value estimation (for every gene-set) Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Counts ES Score

GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts Real ES score value ES Score

GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2,000) Randomized with ES ≥ real: 4 / 2,000 --> Empirical p-value = 0.002 Counts Real ES score value ES Score

More GSEA examples + - UP in treated Gene-set 1 (n=17) Ranked gene list UP in treated + + + + + Gene-set 1 (n=17) Enriched in treated + + Enrichment Score Gene-set 2 Not enriched Enrichment Score + How is the running sum calculated. We have 2 variables N and G. N represents the # of genes in our ranked list, let’s say 20000 genes and G is the number of genes in the gene-set. If the gene I is in the gene-set,….. If the gene I is not in the gene-set… The running sum of X is calculated until we reach the last gene in the ranked list Thus, if a gene is in the gene-set , the running sum increases a lot and if not the running sum decreases just a little bit (this is because in the 20000 gene list, we have much more genes that are not in the gene-set (containing) 50 genes than otherwise) We plot this running sum and the enrichment score corresponds to the maximum deviation from zero. We can have a positive enrichment score if the genes in the gene-set are at the top of the list (thus up-regulated in the treated sample versus the control) and a negative enrichment score if the genes are more at the bottom of the list, corresponding to down-regulated genes. - Gene-set 3 (n=22) Depleted in treated DOWN in treated + Enrichment Score

Multiple test corrections

How to win the P-value lottery, part 1 Random draws Expect a random draw with observed enrichment once every 1 / P-value draws … 7,834 draws later … Whenever you do multiple tests, need to correct the p-values Remind audience of p-value definition Motivating example: Do 20 tests at a p-value of 0.05, expect one false positive Background population: 500 black genes, 4500 red genes

How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw Different annotation RRP6 MRD1 RRP7 RRP43 RRP42 RRP6 MRD1 RRP7 RRP43 RRP42 Use different annotations. “evaluate different annotations”

Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that one or more of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments leading to false negatives, Often one is willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value”

Benjamini-Hochberg example Rank Category P-value Adjusted P-value FDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07 Sort P-values of all tests in decreasing order

Benjamini-Hochberg example Rank Category P-value Adjusted P-value FDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank

Benjamini-Hochberg example Rank Category P-value Adjusted P-value FDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 0.01 0.02 … 0.04 0.05 0.06 0.07 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 0.042 … 0.052 0.061 0.07 Q-value = minimum adjusted P-value at given rank or below

Benjamini-Hochberg example I Rank Category P-value Adjusted P-value 1 2 3 4 5 … 52 53 Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation 0.001 0.002 0.003 0.0031 0.005 … 0.97 0.99 0.001 x 53/1 = 0.053 0.002 x 53/2 = 0.053 0.003 x 53/3 = 0.053 0.0031 x 53/4 = 0.040 0.005 x 53/5 = 0.053 … 0.985 x 53/52 = 1.004 0.99 x 53/53 = 0.99 Adjusted P-value is “nominal” P-value times # of tests divided by the rank of the P-value in sorted list

Benjamini-Hochberg example II FDR / Q-value Rank Category P-value Adjusted P-value 1 2 3 4 5 … 52 53 Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation 0.001 0.002 0.003 0.0031 0.005 … 0.97 0.99 0.001 x 53/1 = 0.053 0.002 x 53/2 = 0.053 0.003 x 53/3 = 0.053 0.0031 x 53/4 = 0.040 0.005 x 53/5 = 0.053 … 0.985 x 53/52 = 1.004 0.99 x 53/53 = 0.99 0.040 0.053 … 0.99 Q-value (or FDR) corresponding to a nominal P-value is the smallest adjusted P-value assigned to P-values with the same or larger ranks.

Benjamini-Hochberg example III P-value threshold for FDR < 0.05 FDR / Q-value Rank Category P-value Adjusted P-value 1 2 3 4 5 … 52 53 Transcriptional regulation Transcription factor Initiation of transcription Nuclear localization Chromatin modification … Cytoplasmic localization Translation 0.001 0.002 0.003 0.0031 0.005 … 0.97 0.99 0.001 x 53/1 = 0.053 0.002 x 53/2 = 0.053 0.003 x 53/3 = 0.053 0.0031 x 53/4 = 0.040 0.005 x 53/5 = 0.053 … 0.985 x 53/52 = 1.004 0.99 x 53/53 = 0.99 0.040 0.053 … 0.99 Red: non-significant Green: significant at FDR < 0.05 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold

Reducing multiple test correction stringency The correction to the P-value threshold α depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories.

Summary Enrichment analysis: Statistical tests Gene list: Fisher’s Exact Test Ranked list: GSEA, also see Wilcoxon ranksum, Mann-Whitney U-test, Kolmogorov-Smirnov test Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association

We are on a Coffee Break & Networking Session