GSEA Overview -- Workflow GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant.

Slides:



Advertisements
Similar presentations
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Advertisements

Charlie Whittaker – BIG meeting 12/3/14
CORRELATION. Overview of Correlation u What is a Correlation? u Correlation Coefficients u Coefficient of Determination u Test for Significance u Correlation.
From the homework: Distribution of DNA fragments generated by Micrococcal nuclease digestion mean(nucs) = bp median(nucs) = 110 bp sd(nucs+ = 17.3.
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Gene Set Enrichment Analysis (GSEA)
Introduction to Microarry Data Analysis - II BMI 730
4.2.2 Inductive Statistics 1 UPA Package 4, Module 2 INDUCTIVE STATISTICS.
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Differentially expressed genes
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.
ViaLogy Lien Chung Jim Breaux, Ph.D. SoCalBSI 2004 “ Improvements to Microarray Analytical Methods and Development of Differential Expression Toolkit ”
Significance Tests P-values and Q-values. Outline Statistical significance in multiple testing Statistical significance in multiple testing Empirical.
Supplementary material Figure S1. Cumulative histogram of the fitness of the pairwise alignments of random generated ESSs. In order to assess the statistical.
Final Project Week 3 - 5/7/09 GSEA and Cluster Computing in Protein Research Leon Kay, Yan Tran, Chris Thomas Yan Gary Chris Leon.
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Different Expression Multiple Hypothesis Testing STAT115 Spring 2012.
Microarray Data Analysis Illumina Gene Expression Data Analysis Yun Lian.
Statistical hypothesis testing – Inferential statistics I.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Wfleabase.org/docs/tileMEseq0905.pdf Notes and statistics on base level expression May 2009Don Gilbert Biology Dept., Indiana University
Claims about a Population Mean when σ is Known Objective: test a claim.
Means Tests Hypothesis Testing Assumptions Testing (Normality)
EnrichNet: network-based gene set enrichment analysis Presenter: Lu Liu.
Gene Set Enrichment Analysis (GSEA)
Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.
“Hotspot” algorithm chr5:131,975, ,012,092 Idea: gauge enrichment of tags relative to a local background model based on the number of tags in a 50kb.
Course on Functional Analysis
Gene Ontology as a tool for the systematic analysis of large-scale gene-expression data Stefan Bentink Joint groupmeeting Klipp/Spang
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Stat 565- Lecture 0 Introduction and Map of this Class.
Bioinformatics lectures at Rice University Li Zhang Lecture 9: Networks and integrative genomic analysis
Bioinformatics for Stem Cell Lecture 2 Debashis Sahoo, PhD.
BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.
Integrating Biology and Statistics: Gene Set Methods BIOS Winter/Spring 2010.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
Central Tendency. Variables have distributions A variable is something that changes or has different values (e.g., anger). A distribution is a collection.
Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.
Statistical Analysis of Microarray Data By H. Bjørn Nielsen.
Appendix B: Statistical Methods. Statistical Methods: Graphing Data Frequency distribution Histogram Frequency polygon.
SUPPLEMENTAL FIGURES AND TABLES. Supplementary Table 1: List of new and improved features in GSEA-P version 2 Java software. Examples and screenshots.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
CTC Guidelines CTC : Casuarina Transcriptome Compendium.
The Broad Institute of MIT and Harvard Differential Analysis.
CGH Data BIOS Chromosome Re-arrangements.
Pathway Ranking Tool Dimitri Kosturos Linda Tsai SoCalBSI, 8/21/2003.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Estimating the False Discovery Rate in Genome-wide Studies BMI/CS 576 Colin Dewey Fall 2008.
1 A Discussion of False Discovery Rate and the Identification of Differentially Expressed Gene Categories in Microarray Studies Ames, Iowa August 8, 2007.
Gene Set Enrichment Analysis. GSEA: Key Features Ranks all genes on array based on their differential expression Identifies gene sets whose member genes.
Canadian Bioinformatics Workshops
Module 2: Analyzing gene lists: over-representation analysis
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Differential Gene Expression
Environmental Modeling Basic Testing Methods - Statistics
Genesets and Enrichment
Statistics and Science
P53 Mediates Vast Gene Expression Changes That Contribute to Poor Chemotherapeutic Response in a Mouse Model of Breast Cancer  Crystal Tonnessen-Murray,
Pathway Informatics December 5, 2018 Ansuman Chattopadhyay, PhD
Statistics of cleavage sites and mutant-enriched sites.
Hyunghoon Cho, Bonnie Berger, Jian Peng  Cell Systems 
Distinct molecular and clinical correlates of H3F3A mutation subgroups
Presentation transcript:

GSEA Overview -- Workflow GSEA is a computational method that determines whether an a priori defined set of genes shows statistically significant, concordant differences between two biological states (e.g. phenotypes).

Three Main Components in GSEA Algorithm Software implementation (Broad Institute) Database of gene sets: o Molecular signature database (MSigDB at the Broad Institute) containing collections of gene sets of interest o Utilities mapping chip features to genes (e.g., Illumina or Affymetrix probe set IDs to HUGO gene symbols)

Start with Gene List ranked by t- statistics (L) (e.g. Tumor vs. Normal) ES<0 ES>0 bands are locations of S genes in L running sum L ALG3 CKAP4 CPLX2 CXCL1 DAD1 DNER ECH1 EZH2 GNAI2 GNAS HNRPA3 HNRPUL1 HSPCB IER3 MAPK8 METAP2 MRPS22 MYC MYCN NFKB1 PSMD2 PTTG1 RXRA RXRB SLC16A9 SNRPF STAT1 TFAP2A TMSB4X TP53 TUBA1 TUBA2 TUBA3D TUBB UBE1 Gene Set (S) (e.g. Metastasis) GSEA: Compares Gene List with a number of Gene Sets

ES(S)  value of maximum deviation from 0 of the running sum Enrichment Score (ES) Calculation  = sum of fold changes for genes in gene set (S) (e.g., 100) N = no. of genes in the array (e.g., 1020) N H = no. of genes in the gene set (S) (e.g., 20) Hits: Genes (L)  S+|FC| /  Misses: Genes (L)  S -1/(N-N H ) Contribution to running sum for ES Hits +|FC| /  Misses -1/(N-N H ) Running sum for ES ……… … Start with ranked list (L) of genes that are in (Hit) or not in (Miss) a gene set (S), using fold change (FC) as example metric Hit Hit Miss Hit Hit Miss Ranked List (L) FC running sum L

A positive ES gene set (Genelist is comparison between p53 mutant and WT) Zero crossing of ranking metric values ES(S) running enrichment score + - locations of genes in S p53 WT p53 MUT

Zero crossing of ranking metric values ES(S) running enrichment score + - locations of genes in S p53 WT p53 MUT A negative ES gene set (Genelist is comparison between p53 mutant and WT)

2 Ways of Testing the Significance of ES 1. Phenotype permutation: randomly shuffle phenotype T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 N1N1 N2N2 N3N3 N4N4 N5N5 N6N6 N7N7 : 1000 x Histogram of 1000 ES(S,  ) Scores ES(S,  1 ) ES(S,  2 ) ES(S,  3 ) : ES(S,  1000 ) ES(S) N7N7 T5T5 N3N3 T2T2 N6N6 N1N1 T4T4 N5N5 T1T1 N4N4 T7T7 T3T3 T6T6 N2N2 The empirical, nominal p-value for each ES(S) is then calculated relative to the null distribution for ES(S): p = fraction of ES(S,  ) values ≥ ES(S) T5T5 N6N6 T3T3 N2N2 T6T6 T1T1 N4N4 N5N5 N1N1 T4T4 N7N7 N3N3 T7T7 T2T2 N3N3 T6T6 N7N7 N1N1 N5N5 T3T3 T7T7 T5T5 N6N6 T1T1 N4N4 T2T2 N2N2 T4T4 Need >= 7 samples/phenotype

T1T2T3T4N1N2N3N4 Histogram of 1000 ES(S,  ) Scores ES(S,  1 ) ES(S,  2 ) ES(S,  3 ) : ES(S,  1000 ) ES(S) The empirical, nominal p-value for each ES(S) is then calculated relative to the null distribution for ES(S): p = fraction of ES(S,  ) values ≥ ES(S) ACBGDXMPQ KYWLFHG IP CUKTVZWRS 2. Gene set permutation: randomly select genes for gene set 2 Ways of Testing the Significance of ES When n <7 samples/phenotype

How normalized enrichment scores (NES) are calculated from ES (using the NES helps normalize out effect of different gene set sizes) mean  {ES(S,  ) values with the same sign as ES(S,  k )} ES(S,  k ) NES(S,  k )  For each permutation  and gene set S, compute NES(S,  ) to use in computing the FDR: ES(S,  ) ES(S,  1 ) ES(S,  3 ) ES(S,  2 ) Histogram of NES(S,  ) Scores NES(S,  ) NES* NES(S,  ) ≥ NES* FDR q-value (<0.05)

MSigDB