Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops"— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2 2Module #: Title of Module

3 Module 2 Finding over-represented pathways in gene lists http://morrislab.med.utoronto.ca

4 Module 2 bioinformatics.ca Enrichment Test Spindle0.00001 Apoptosis0.00025 Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

5 Module 2 bioinformatics.ca 5 Enrichment Analysis Given: 1.Gene list: e.g. RRP6, MRD1, RRP7, RRP43, RRP42 (yeast) 2.Gene sets or annotations: e.g. Gene ontology, transcription factor binding sites in promoter Question: Are any of the gene annotations surprisingly enriched in the gene list? Details: – Where do the gene lists come from? – How to assess “surprisingly” (statistics) – How to correct for repeating the tests

6 Module 2 bioinformatics.ca Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test -Significance analysis of microarrays UP DOWN UP DOWN Selection by Threshold

7 Module 2 bioinformatics.ca Time-course Design Expression Matrix t1t1 t2t2 t3t3 …tntn Gene Clusters E.g.: - K-means - K-medoids - SOM

8 Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP) Background genes (array genes not significant)

9 Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene list (e.g UP) Background genes (array genes not significant) Gene-set

10 Module 2 bioinformatics.ca Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene-set Gene list (e.g UP) Overlap between gene list and gene-set Background genes (array genes not significant)

11 Module 2 bioinformatics.ca Enrichment Test Significant genes (e.g UP) Overlap between gene list and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Random samples of array genes

12 Module 2 bioinformatics.ca 12 Fisher’s exact test a.k.a., the hypergeometric test Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 Null hypothesis: List is a random sample from population Alternative hypothesis: More black genes than expected

13 Module 2 bioinformatics.ca 13 Background population: 500 black genes, 4500 red genes Gene list RRP6 MRD1 RRP7 RRP43 RRP42 P-value Null distribution Answer = 4.6 x 10 -4 Fisher’s exact test a.k.a., the hypergeometric test

14 Module 2 bioinformatics.ca 14 Important details To test for under-enrichment of “black”, test for over- enrichment of “red”. Need to choose “background population” appropriately, e.g., if only portion of the total gene complement is queried (or available for annotation), only use that population as background. To test for enrichment of more than one independent types of annotation (red vs black and circle vs square), apply Fisher’s exact test separately for each type. ***More on this later***

15 Module 2 bioinformatics.ca Beyond Fisher’s Exact Test UP DOWN ENRICHMENT TEST ENRICHMENT TEST Gene list Fisher’s Test (Binomial and Gene list Fisher’s Test (Binomial and Ranked list (semi- quantitative) e.g. GSEA, WMW test KS test Ranked list (semi- quantitative) e.g. GSEA, WMW test KS test UP DOWN

16 Module 2 bioinformatics.ca Beyond Fisher’s Exact Test Possible problems with Fisher’s Exact Test – No “natural” value for the threshold – Different results at different threshold settings – Possible loss of statistical power due to thresholding No resolution between significant signals with different strengths Weak signals neglected Solution: enrichment tests based on ranked lists

17 Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle0.00010.01 Apoptosis0.0250.09 Gene-set Databases GSEA Enrichment Table Ranked Gene List

18 Module 2 bioinformatics.ca GSEA Enrichment Test Gene-setp-valueFDR Spindle0.00010.01 Apoptosis0.0250.09 Gene-set Databases GSEA Enrichment Table Ranked Gene List The p-value depends only on a single gene-set The FDR corrects for multiple testing

19 Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)

20 Module 2 bioinformatics.ca GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)

21 Module 2 bioinformatics.ca GSEA: Method Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end? ES score calculation

22 Module 2 bioinformatics.ca GSEA: Method Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution ES score Gene rank 110014001 ES score calculation

23 Module 2 bioinformatics.ca GSEA: Method MAX (absolute value) running ES score --> Final ES Score ES score calculation

24 Module 2 bioinformatics.ca GSEA: Method High ES score High local enrichment ES score calculation

25 Module 2 bioinformatics.ca GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Counts ES Score

26 Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value

27 Module 2 bioinformatics.ca GSEA: Method Estimate empirical p-value by comparing observed ES score to null-hypothesis distribution from randomized data (for every gene-set) Distribution of ES from N permutations (e.g. 2000) Counts ES Score Real ES score value Randomized with ES ≥ real: 4 / 2000 --> Empirical p-value = 0.002

28 Module 2 bioinformatics.ca Using GSEA Installation Launch Desktop Application from: http://www.broadinstitute.org/gsea/msigdb/downloads.jsp Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time – (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

29 Module 2 bioinformatics.ca KS-test vs GSEA KS-test p-values can be directly computed without permutations (so faster) KS-test is older, used outside genomics GSEA has a visual appeal KS-test is not valid with small gene sets GSEA is a modified KS-test

30 Module 2 bioinformatics.ca Multiple test corrections

31 Module 2 bioinformatics.ca How to win the P-value lottery, part 1 Background population: 500 black genes, 4500 red genes Random draws … 7,834 draws later … Expect a random draw with observed enrichment once every 1 / P-value draws

32 Module 2 bioinformatics.ca How to win the P-value lottery, part 2 Keep the gene list the same, evaluate different annotations Observed draw RRP6 MRD1 RRP7 RRP43 RRP42 Different annotations RRP6 MRD1 RRP7 RRP43 RRP42

33 Module 2 bioinformatics.ca Simple P-value correction: Bonferroni If M = # of annotations tested: Corrected P-value = M x original P-value Corrected P-value is greater than or equal to the probability that any single one of the observed enrichments could be due to random draws. The jargon for this correction is “controlling for the Family-Wise Error Rate (FWER)”

34 Module 2 bioinformatics.ca Bonferroni correction caveats Bonferroni correction is very stringent and can “wash away” real enrichments. Often users are willing to accept a less stringent condition, the “false discovery rate” (FDR), which leads to a gentler correction when there are real enrichments.

35 Module 2 bioinformatics.ca False discovery rate (FDR) FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance. Typically FDR corrections are calculated using the Benjamini-Hochberg procedure. FDR threshold is often called the “q-value”

36 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 0.001 0.01 0.02 … 0.04 0.055 0.06 0.07 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation Sort P-values of all tests in decreasing order

37 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 Adjusted P-value = P-value X [# of tests] / Rank 0.001 0.01 0.02 … 0.04 0.055 0.06 0.07

38 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRankFDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 Q-value = minimum adjusted P-value at given rank or below 0.042 … 0.042 0.052 0.061 0.07 0.001 0.01 0.02 … 0.04 0.055 0.06 0.07

39 Module 2 bioinformatics.ca Benjamini-Hochberg example P-valueCategoryAdjusted P-valueRank FDR / Q-value 1 2 3 … 50 51 52 53 Transcriptional regulation Transcription factor Initiation of transcription … Nuclear localization RNAi activity Cytoplasmic localization Translation 0.001 x 53/1 = 0.053 0.01 x 53/2 = 0.27 0.02 x 53/3 = 0.35 … 0.04 x 53/50 = 0.042 0.05 x 53/51 = 0.052 0.06 x 53/52 = 0.061 0.07 x 53/53 = 0.07 P-value threshold is highest ranking P-value for which corresponding Q-value is below desired significance threshold 0.042 … 0.042 0.052 0.061 0.07 P-value threshold for FDR < 0.05 FDR < 0.05? YYY…YNNNYYY…YNNN 0.001 0.01 0.02 … 0.04 0.055 0.06 0.07

40 Module 2 bioinformatics.ca Reducing multiple test correction stringency The correction to the P-value threshold  depends on the # of tests that you do, so, no matter what, the more tests you do, the more sensitive the test needs to be Can control the stringency by reducing the number of tests: e.g. use GO slim; restrict testing to the appropriate GO annotations; or select only larger GO categories.

41 Module 2 bioinformatics.ca Summary Enrichment analysis: – Statistical tests Gene list: Fisher’s Exact Test Gene rankings: GSEA, KS-test – Multiple test correction Bonferroni: stringent, controls probability of at least one false positive* FDR: more forgiving, controls expected proportion of false positives* -- typically uses Benjamini-Hochberg * Type 1 error, aka probability that observed enrichment if no association

42 Module 2 bioinformatics.ca Lab Time Part 1 – Try out DAVID – use the yeast gene list and Module 2 gene list – Follow Lab 2 protocol – http://david.abcc.ncifcrf.gov/ Part 2 – Continue with the Cytoscape protocol – http://opentutorials.rbvi.ucsf.edu/index.php/Tutorial:Introduction_to_ Cytoscape Part 3 – Try out enrichment map – load the plugin from the plugin manager – Load DAVID results – or - load the GSEA enrichment analysis file - EM_EstrogenMCF7_TestData.zip (unzip) available at http://baderlab.org/Software/EnrichmentMap

43 Module 2 bioinformatics.ca If DAVID doesn’t recognize your genes, it can try to detect the correct identifiers to use Upload your gene list

44 Module 2 bioinformatics.ca Step 1 Gene ID mapping results

45 Module 2 bioinformatics.ca Run the enrichment analysis

46 Module 2 bioinformatics.ca Run the enrichment analysis

47 Module bioinformatics.ca We are on a Coffee Break & Networking Session


Download ppt "Canadian Bioinformatics Workshops"

Similar presentations


Ads by Google