Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.

Slides:

Advertisements

Similar presentations

Microarray statistical validation and functional annotation

Advertisements

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.

Gene Ontology John Pinney

Gene function analysis Stem Cell Network Microarray Course, Unit 5 May 2007.

Distinguishing Regulators of Biomolecular Pathways Mentor: Dr. Xiwei Wu City of Hope Sean Caonguyen SoCalBSI 8/21/08.

Introduction to Functional Analysis J.L. Mosquera and Alex Sanchez.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

Gene Co-expression Network Analysis BMI 730 Kun Huang Department of Biomedical Informatics Ohio State University.

Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.

Biological Interpretation of Microarray Data Helen Lockstone DTC Bioinformatics Course 9 th February 2010.

Demonstration Trupti Joshi Computer Science Department 317 Engineering Building North (O)

Introduction to molecular networks Sushmita Roy BMI/CS 576 Nov 6 th, 2014.

 2 Outline  Review of major computational approaches to facilitate biological interpretation of  high-throughput microarray  and RNA-Seq experiments.

Gene Set Enrichment Analysis Petri Törönen petri(DOT)toronen(AT)helsinki.fi.

DEMO CSE fall. What is GeneMANIA GeneMANIA finds other genes that are related to a set of input genes, using a very large set of functional.

Introduction The goal of translational bioinformatics is to enable the transformation of increasingly voluminous genomic and biological data into diagnostics.

Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.

>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.

Gene Set Enrichment Analysis (GSEA)

A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.

DNA microarray technology allows an individual to rapidly and quantitatively measure the expression levels of thousands of genes in a biological sample.

EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.

Jesse Gillis 1 and Paul Pavlidis 2 1. Department of Psychiatry and Centre for High-Throughput Biology University of British Columbia, Vancouver, BC Canada.

Networks and Interactions Boo Virk v1.0.

Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.

CellFateScout step- by-step tutorial for a case study Version 0.94.

Course on Functional Analysis

Gene expression analysis

Copyright OpenHelix. No use or reproduction without express written consent1.

UBio Training Courses Micro-RNA web tools Gonzalo

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

Monday, November 8, 2:30:07 PM  Ontology is the philosophical study of the nature of being, existence or reality as such, as well as the basic categories.

From Functional Genomics to Physiological Model: Using the Gene Ontology Fiona McCarthy, Shane Burgess, Susan Bridges The AgBase Databases, Institute of.

Tutorial session 3 Network analysis Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.

Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.

BIOS6660 shRNAseq Gene Set Enrichment Analysis Tzu L Phang PhD Robert Stearman PhD April 16, 2014.

Tutorial 7 Gene expression analysis 1. Expression data –GEO –UCSC –ArrayExpress General clustering methods –Unsupervised Clustering Hierarchical clustering.

Getting Started: a user’s guide to the GO GO Workshop 3-6 August 2010.

Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.

1 Gene function annotation. 2 Outline  Functional annotation  Controlled vocabularies  Functional annotation at TAIR  Resources and tools at TAIR.

Getting Started: a user’s guide to the GO TAMU GO Workshop 17 May 2010.

Statistical Testing with Genes Saurabh Sinha CS 466.

Gene set analyses of genomic datasets Andreas Schlicker Jelle ten Hoeve Lodewyk Wessels.

Extracting binary signals from microarray time-course data Debashis Sahoo 1, David L. Dill 2, Rob Tibshirani 3 and Sylvia K. Plevritis 4 1 Department of.

Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.

SUPPLEMENTAL FIGURES AND TABLES. Supplementary Table 1: List of new and improved features in GSEA-P version 2 Java software. Examples and screenshots.

The Broad Institute of MIT and Harvard Differential Analysis.

1 Annotation EPP 245/298 Statistical Analysis of Laboratory Data.

GO enrichment and GOrilla

Microarray Data Analysis The Bioinformatics side of the bench.

Discovering functional interaction patterns in Protein-Protein Interactions Networks   Authors: Mehmet E Turnalp Tolga Can Presented By: Sandeep Kumar.

Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Gene Set Analysis using R and Bioconductor Daniel Gusenleitner

Canadian Bioinformatics Workshops

CCLE Cancer Cell Line Encyclopedia Alexey Erohskin.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.

Canadian Bioinformatics Workshops

David Amar, Tom Hait, and Ron Shamir

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

GO : the Gene Ontology & Functional enrichment analysis

Canadian Bioinformatics Workshops

What is an Ontology An ontology is a set of terms, relationships and definitions that capture the knowledge of a certain domain. (common ontology ≠ common.

Gene expression analysis

Anastasia Baryshnikova Cell Systems

Presentation transcript:

Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization Lesson 2: Networks and Pathways Networks: data sources and visualization Pathways

PART 1 Introduction How do we relate microarray expression data to biological function?

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

From differential genes to biological functions How do my data relate to known biological functions? Are there specific functions that are characterized by gene expression changes? ?!

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS  This lecture

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS  Next week lecture

PART 2 Gene-set Enrichment Analysis What is gene-set enrichment analysis? How does it help interpreting microarray data?

What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets -Every set of genes is associated to a specific cellular function, process, component or pathway

What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns?

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP NOT SIGNIFICANT NOT SIGNIFICANT UP DOWN

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? This is the meaning of significant enrichment We will see what’s the “statistical” definition of enrichment in PART.4

PART 3 Gene-set Enrichment: Data What data sources are available for gene-set enrichment analysis?

Gene-set Data Sources Break down cellular function into gene sets Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP Where can I get these gene-sets? How were the gene-sets compiled? How are they structured?

Gene Ontology (GO) Gene Ontology is: – a hierarchically-structured, Functional categories are organized hierarchically, i.e. a system of inter-related sets with increasing scope specificity (parent-child relations) – controlled vocabulary Functional categories are defined by experts, and then must be used consistently for annotation – for gene product function annotation Gene products (i.e. proteins) are annotated using GO functional categories (“terms”) – It is general for all species

Gene Ontology: Example Terms are organized hierarchically – Terms on top are more general, terms on bottom are more narrow in scope – If a protein is annotated as Spindle, the annotation should be automatically inferred also for all progenitors of Spindle (up-propagation)

Gene Ontology: Example

PARENT CHILD

Gene Ontology: Example PARENT CHILD

Gene Ontology: Example CHILD PARENT Gene Ontology and the corresponding gene-sets

Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 DUCZ Gene Gene-set

Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 The set corresponding to the CHILD is a subset of the one corresponding to the PARENT DUCZ

Gene Ontology: Example

Gene Ontology: Partitions GO has three independent partitions, which are not interconnected: – Molecular Function Describes biochemical activities, in-vitro binding specificities, etc… Example: Ligase Activity, Kinase Activity, DNA Binding – Cellular Component Describes parts of the cell Example: Mitochondrion, Spindle Microtubule – Biological Process Describes processes at the intra-cellular and organism level Example: DNA Replication, Apoptosis, Development

MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS Ligase Activity Mitochondrion DNA Replication

Gene Ontology: Partitions MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS First-level children (list)

Gene Ontology Levels Every partition has several levels… ROOT LEVEL-1 LEVEL-2 LEVEL-N

Gene Ontology Levels However, terms at the same level don’t necessarily have the same degree of granularity (i.e. specificty of scope) BIOLOGICAL PROCESS SIGNALING IMMUNE SYSTEM PROCESS PIGMENTATION Different granularity!!!

Gene Ontology Annotations How are gene annotated with GO terms? Human curators go through the literature and mining for gene functions -Different genomic databases take part to this effort -Evidence Codes are used to keep track of the type of evidence for annotation -IEA annotations are directly imported from databases, without human curation Important Note: Primary annotations are not propagated using the ontology; therefore: when you download GO gene-sets always make sure that up-propagation was done

Gene Ontology Evidence Codes ISS: Inferred from Sequence/Structural Similarity IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern TAS: Traceable Author Statement NAS: Non-traceable Author Statement IC: Inferred by Curator ND: No Data available IEA: Inferred from electronic annotation More at:

Gene Ontology Evidence Codes How should I use evidence codes? – Quality Filter for Gene-set Enrichment Sometimes IEA (Electronic Annotations) are considered less reliable, and are not used for analysis However, this should be evaluated very carefully and cannot be generalized – Gene Browsing If you are interested in the function of a specific gene, you can check if multiple evidences are available

Annotation Inheritance There are primary and inherited annotations – Primary Annotations Originally defined by curators – Inherited Annotations Back-propagated along the hierarchy Always check if the gene ontology annotation resource you are using includes inherited annotations!

Annotation Inheritance Primary Annotation: Spindle

Annotation Inheritance Inherited Annotations: Microtubule Cytoskeleton Cytoskeletal Part Cytoskeleton Intracellular Organelle Part …

Gene Ontology: Multi-function Besides hierarchical term organization, genes can be multi-functional, i.e. annotated by many independent terms – In the following slide we see an excerpt of p53 (the “Warden of Genome”) annotations, as reported by the NCBI database Entrez-Gene

Gene Ontology: Statistics ( 29,922Total Terms 8,688Molecular Function 2,689Cellular Component 18,545Biological Process Annotated Genes (Entrez-Gene) 17,482Human 18,028Mouse

Exploring Gene Ontology: QuickGO

Exploring Gene Ontology: QuickGO

Exploring Gene Ontology: QuickGO New search Essential Data Term in the GO graph

Gene-sets: Beyond Gene Ontology There are many other sources and types of gene-sets: -Pathways (e.g. KEGG) -Protein Families / Domains (e.g. PFAM) -Predicted Targets of Regulators (e.g. MSigDB-c3) -miRNA, Transcription Factors -Protein-protein Interaction Modules -Gene Expression -Up/down after treatment or in relation to disease (e.g. MSigDB-c2) -Co-expression across many conditions (e.g. MSigDB-c4) -Genotype-phenotype association (e.g. DiseaseHub) -Genomic position (e.g. MSigDB-c1)

Pathways and GO Biol. Process How do pathways and processes differ? – In a purely biological perspective, the question is philosophical (still worth speculating…) – In a bioinformatics perspective: A gene is annotated for a GO Biological Processes if the curators deem it (significantly) contributes to the process (which is at the cellular or organ level), according to a number of evidences Pathways include the “wiring” of genes/gene products, hence they rely on a more intensive curation process Some pathways include large ubiquitous actors (such as the proteasome) that may confound enrichment analysis, whereas these are usually absent from GO process

A pathway example: the MAPK cascade in KEGG (

Major Gene-set Resources A-Z Bioconductor – GO: GO.db + org.Xx.eg.db (org.Xx.egGO2ALLEGS) – KEGG: KEGG.db + org.Xx.eg.db (org.Xx.egPATH) – PFAM: PFAMEDE + org.Xx.eg.db (org.Xx.egPFAM) – Note: Xx has to be replaced with the species id {Hs, Mm, Rn, etc…} DiseaseHub ( – Phenotype-genotype (OMIM, GAD, HGMD, PharmGKB, CGP, GWAS) MSigDB ( – GO (*no IEA), Pathways (KEGG, Biocarta, STKE, GenMAPP, PharmGKB, GEArray), Predicted Targets (miRNA: ?, TF: Transfac), Gene Expression, Genomic Positions PathwayCommons ( – Pathways: Reactome, NCI, Cell map WhichGenes ( – GO, Pathways (KEGG, Biocarta, Reactome), Genomic Positions, Regulators (miRNA: TargetScan, miRBase), Phenotype-genotype (geneCards Disease, CancerGenes)

Exploring MSigDB (1)

Exploring MSigDB (2) Alzheimer

Exploring MSigDB (3) Select this gene-set

Exploring MSigDB (4)

Exploring MSigDB (5) I now want to see how the gene-set I was interested in overlaps with other gene-sets in the collection (I selected only a few types)

Exploring MSigDB (6) We will se how this p-value is computed and what it means in the next part (enrichment methods)

Gene-set Resources Tips to navigate the resource ocean / 1: – Start your analysis using only a few, reliable sources (e.g. GO, KEGG) GO also has a very large gene coverage – After the first-pass analysis, expand your gene-set collection to types you are interested in – Don’t try from the beginning everything together – Remember quality and clarity! Target predictions may be unreliable Gene expression-derived sets are often hard to interpret

Gene-set Resources Tips to navigate the resource ocean / 2: – If you are confident with R, start from Bioconductor, and supplement the missing pathways shopping around GO: Bioconductor Pathways: Pathway Commons Phenotype-genotype: DiseaseHub Gene Expression: MSigDB Useful scripts available at:

Gene-set Resources Tips to navigate the resource ocean / 2: – If you are not confident with R, and you are a GSEA user, use MSigDB and Pathway Commons From both resources you can download GMT files (GMT is the format used by GSEA) Remember that GO gene-sets in MSigDB do not have IEA-backed annotations – Both Bioconductor and MSigDB incorporate GO inherited annotations (back-propagated)

Summary of PART 3 Gene-set Data Sources – Gene Ontology, a hierarchically structured controlled vocabulary for gene function annotation, is the main source of gene-sets – Other valuable sources are availables, such as pathway databases In the next part we will see how to use gene-set for enrichment analysis…

Now, take a…

And ready to dive again!

PART 4 Gene-set Enrichment: Methods What statistical methods can I use to score gene-sets for enrichment?

Enrichment Test Spindle Apoptosis Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

Enrichment Test Spindle Apoptosis ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Experimental Data A priori knowledge + existing experimental data Microarray Experiment (gene expression table) Gene-set Databases

Enrichment Test Spindle Apoptosis Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Interpretation & Hypotheses Microarray Experiment (gene expression table)

Enrichment Test Spindle Apoptosis Enrichment Table FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. SPP1 SPP2 CCCP MTC1 … SPP1 SPP2 CCCP MTC1 … Gene-sets Microarray Experiment (gene expression table)

Enrichment Test ENRICHMENT TEST ENRICHMENT TEST How? Microarray Experiment (gene expression table)

Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test UP DOWN UP DOWN Selection by Threshold

Time-course Design Expression Matrix t1 t2t3…tn Gene Clusters E.g.: - K-means - K-medoids - SOM

Other Designs Expression Matrix Significant Genes E.g.: - ANOVA - Linear Model

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant)

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant) Gene-set

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene-set Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant)

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes?

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Random sample of array genes

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Statistical Model: Fisher’s Exact Test Statistical Model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypotehsis distribution (Hypergeometric Distribution)

Fisher’s Exact Test For Gene-set Enrichment Enrichment P-value ab cd MEMO: P-value ~ 0 --> significant P-value ~ 1 --> not significant a, b, c, d are the size of the fours subsets (each subset has a different color) © by Black Box Inc. R: help (fisher.test)

Fisher’s Exact Test For Gene-set Overlap We can also use Fisher’s Exact Test to evaluate the overlap between gene-sets from databases Going back to MSigDB… Now we know where these p-values come from!

Web Resources for Fisher’s Exact Test ConceptGen Note: free account required DAVID Note: thorough description of how to use in this paper: Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1): PMID:

Beyond Fisher’s Test UP DOWN ENRICHMENT TEST ENRICHMENT TEST Threshold- dependent e.g. Fisher’s Test Threshold- dependent e.g. Fisher’s Test Whole- distribution e.g. GSEA Whole- distribution e.g. GSEA UP DOWN

Beyond Fisher’s Test Whole-distribution methods have been shown to be more stable and statistically powerful – No “natural” value for the threshold – Different results at different threshold settings – Loss of information due to thresholding No resolution between significant signals with different strengths Weak signals neglected --> Use whole-distribution whenever possible

GSEA Enrichment Test / 1 Ranked Gene List Two-class comparison Class-1Class-2 Expression Matrix Correlation to phenotype Quantitative Phenotype - Fold change - Log (ratio) - t-test - SAM -Pearson correlation Expression Matrix

GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List

GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List The p-value depends only on the single gene-set performance The FDR depends on the performance of all gene-sets

GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43)

GSEA: Method ES score calculation Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end?

GSEA: Method ES score calculation Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution to the running ES score

GSEA: Method ES score calculation MAX running ES score --> Final ES Score

GSEA: Method ES score calculation High ES score High local enrichment

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Number of instances ES Score

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000)

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000) Randomized with ES ≥ real: 4 / > Empirical p-value = 0.002

GSEA Settings: Permutation Permutation settings have important implications which we will not discussed in detail Practical suggestions: – When biological replicates are very similar within classes and classes are well seperated --> gene permutation – When biological replicates tend to be dissimilar, or stratified according to hidden experimental factors --> use other whole-distribution enrichment methods of self-contained type (e.g. SAM-GS)

GSEA Settings: Gene-set Filter Gene-set for enrichment analysis are usually filtered by size – Large gene-sets are undesired, if they are derived from Gene Ontology or other functional resources, as they usually correspond to uninformative concepts (e.g. Regulation of Biopolymer Catabolism) – Small gene-sets are undesired as their statistics are quite noisy, and they may decrease the FDR of other sets – See Using GSEA section for the specific value of size filtering settings

Using GSEA Installation Launch Desktop Application from: Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

Using GSEA Data Format There are three data files you will need: – Gene-set (.GMT) – Gene Expression Table (.txt) – Gene Expression Phenotypes (.CLS) The formats requirements follow. More on GSEA data formats:

Using GSEA Data Format: gene-set file (.GMT) Syntax: > [\tab] > [\tab] > [\tab] > Notes: Either use the gene-set ID for the Name (e.g. GO ID) and the gene- set full name for the Description Or use the gene-set full name for the Name and the source database for the Description Example: regulation of DNA recombinationGO: transition metal ion transportGO:

Using GSEA Data Format: gene expression table file (.txt) Syntax: table > [\tab] > [\tab] > [\tab] > Notes: Use the gene ID for the Name (e.g. GO ID) and the gene symbol and/or full name for the Description I recommend using EntrezGene IDs, for a number of reasons Gene IDs must be consistent between the GMT and this file Example:

Using GSEA Data Format: expression phenotypes file (.CLS) # Tg-A Tg-B WT Tg-A Tg-A Tg-A Tg-B Tg-B Tg-B WT WT WT Use space as separator Phenotype labels for all samples in the gene expression tables Always 1 Number of classes Number of samples Class Labels

Using GSEA Load the data

Using GSEA Load the data

Using GSEA Run the analysis – Parameter setting / 1 Load gene expression table here Load gene-set (.GMT) file here 2000 Load phenotype file (.CLS) here gene.-set If your gene expression table has probe IDs already matching with the.GMT file, set this this to FALSE. If your gene expression table has probe IDs already matching with the.GMT file, you don’t need this.

Using GSEA Run the analysis – Parameter setting / 2 Differential statistic. Use t-test (or signal-to-noise) if you have at least 3 replicates. 10 is usually good. Keep between 7-8 and is usually good. Keep between 500 and 800.

Using GSEA GSEA Pre-ranked – If you wish to use a statistic for differential expression other than GSEA, you can using the Pre-ranked mode More on GSEA pre-ranked data format: #RNK:_Ranked_list_file_format_.28.2A.rnk.29

Summary of PART 4 Methods for Gene-set Enrichment – Fisher’s Exact Test can be used for any given set of experimental genes – When possible, use GSEA to achieve greater power – Both GSEA and Fisher’s Exact Test require to score genes for significance/differentiality; how this is done depends on the microarray design

Now, take a…

And ready to dive again!

PART 5 Gene-set Enrichment: Visualization How to use enrichment analysis to functionally map cellular activity. Or, everything finally coming together.

Gene-set Enrichment: Redundancy Problem Many redundant gene-sets – Gene Ontology has a very large number of gene- sets, often with slight differences – Different pathway databases have different yet overlapping definitions of pathways – Globally, it is useful to grasp the overlap relations between enriched gene-sets --> we need a visualization framework going beyond the enrichment table

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO: taxis 2.18E GO: chemotaxis 2.18E GO: adaptive immune response based on somatic recombination 7.10E GO: adaptive immune response 7.10E GO: leukocyte mediated immunity GO: B cell mediated immunity GO: myeloid cell differentiation GO: immune effector process GO: regulation of phagocytosis GO: positive regulation of phagocytosis GO: lymphocyte mediated immunity GO: growth factor binding GO: protein polymerization GO: endoplasmic reticulum membrane GO: immunoglobulin mediated immune response GO: heart development GO: response to bacterium GO: regulation of endocytosis GO: acute inflammatory response GO: positive regulation of endocytosis GO: myeloid leukocyte activation GO: amino acid biosynthetic process GO: regulation of inflammatory response GO: activation of immune response GO: positive regulation of immune system process GO: positive regulation of immune response GO: antigen processing and presentation GO: regulation of immune system process GO: regulation of immune response GO: negative regulation of enzyme activity GO: phagocytosis GO: myeloid leukocyte differentiation GO: humoral immune response GO: lymphocyte activation GO: leukocyte chemotaxis GO: negative regulation of protein kinase activity GO: negative regulation of transferase activity GO: transforming growth factor beta receptor signaling pathw GO: insulin-like growth factor binding GO: T cell activation GO: humoral immune response mediated by circulating immunogl GO: cytosolic ribosome (sensu Eukaryota) GO: protein amino acid N-linked glycosylation GO: positive regulation of multicellular organismal process GO: chemokine receptor binding GO: chemokine activity GO: Wnt receptor signaling pathway

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO: taxis 2.18E GO: chemotaxis 2.18E GO: adaptive immune response based on somatic recombination 7.10E GO: adaptive immune response 7.10E GO: leukocyte mediated immunity GO: B cell mediated immunity GO: myeloid cell differentiation GO: immune effector process GO: regulation of phagocytosis GO: positive regulation of phagocytosis GO: lymphocyte mediated immunity GO: growth factor binding GO: protein polymerization GO: endoplasmic reticulum membrane GO: immunoglobulin mediated immune response GO: heart development GO: response to bacterium GO: regulation of endocytosis GO: acute inflammatory response GO: positive regulation of endocytosis GO: myeloid leukocyte activation GO: amino acid biosynthetic process GO: regulation of inflammatory response GO: activation of immune response GO: positive regulation of immune system process GO: positive regulation of immune response GO: antigen processing and presentation GO: regulation of immune system process GO: regulation of immune response GO: negative regulation of enzyme activity GO: phagocytosis GO: myeloid leukocyte differentiation GO: humoral immune response GO: lymphocyte activation GO: leukocyte chemotaxis GO: negative regulation of protein kinase activity GO: negative regulation of transferase activity GO: transforming growth factor beta receptor signaling pathw GO: insulin-like growth factor binding GO: T cell activation GO: humoral immune response mediated by circulating immunogl GO: cytosolic ribosome (sensu Eukaryota) GO: protein amino acid N-linked glycosylation GO: positive regulation of multicellular organismal process GO: chemokine receptor binding GO: chemokine activity GO: Wnt receptor signaling pathway adaptive immune response based on somatic recombination adaptive immune response leukocyte mediated immunity B cell mediated immunity myeloid cell differentiation immune effector process regulation of phagocytosis positive regulation of phagocytosis lymphocyte mediated immunity

Gene-set Enrichment: Redundancy Problem How to handle the redundancy problem? – Statistical solutions: Correct for inter-redundancy and prioritize the most enriched gene-sets Don’t always work well, not available for all tests --> not discussed here – Visualization solution: visualize gene-set overlap as a network Enrichment Map (Cytoscape plugin)

Enrichment Map

Enrichment Significance Class A (e.g. UP) Class B (e.g. DOWN)

Enrichment Map A B

Application Example Estrogen treatment of Breast Cancer Cells Overall Design: -2 classes (treated, untreated) -3 time points 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 We will start off by analyzing only the 24 hours time point, which has the maximal induction, although its is functionally similar to the 12 hours time-point

Clusters were manually identified and tagged; they represent highly inter-related gene-sets

Condition Comparison Enrichment Map can be used to compare enrichments Use cases: – Different experiments – Different condition comparisons within the same experiment 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 Now we can analyze together the 12 and 24 hours time-points Notice that we are always comparing the treated to the untreated Example: same data-set (Estrogen treatment)

Heat-map Feature Heat-maps can be used to explore gene expression patterns – Microarray data are typically normalized by-row for heat-map visualization i.Subtract the mean ii.Divide by the standard deviation – This setting is available in Enrichment Map

Down Up

Gene Ontology Restructured Gene Ontology is hierarchical, and terms are highly redundant / inter- related / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / inter- dependent terms

Enrichment Map How-to Installation 1.Install Cytoscape 2.Dowload Enrichment Map plugin 3.Copy the plugin into the Cytoscape plugin folder win C:\Program Files\Cytoscape\plugins mac Applications/Cytoscape/plugins

Enrichment Map: How-to Load Data – Open Cytoscape, load the Enrichment Map plugin from the menu: plugins/ Enrichment Map/Load Enrichment Results 1.Format: GSEA – Use the generic if you have generated enrichment results outside GSEA; follow the manual for formatting instructions 2.Load the gene-set file (GMT) 3.Load the expression matrix (tab-sep txt) 4.This is optional 5.Change the settings as follows: – Set the p-value cut-off to – Set the FDR q-value cut-off to 0.05 (5%) – Select the overlap coefficient More at:

Enrichment Map: How-to Browse results – Enrichment Map is a Cytoscape plugin – We will fully learn how to use Cytoscape in the next lesson – In this lesson, we will just see essential functionalities

Nodes can be dragged and dropped, or deleted

Use this panel to move the view of the network around

Heat-map view Click on nodes to access Normalization setting: Row Normalize Data

These parameters can be tuned to include/exclude gene-sets from the map, depending on their enrichment scores

Rerun the layout from: Layout/Cytoscape Layouts/ Force Directed Layout/ Weighted

Summary of PART 5 Visualization of Gene-set Enrichment – Gene-set enrichment is valuable to summarize the functional landscape of cellular activity (in our case, gene expression) – Gene-sets are highly redundant, organizing them as a network highly facilitates navigation and interpretation Software: Enrichment Map

Further Readings Enrichment Analysis (Methods): Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform May;9(3): PMID: Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform Jan;10(1): PMID: Enrichment Map: Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps. Proteomics Feb 1. PMID:

Assignment Rules – Forum discussion: Of course, you are free to discuss general topics on the forums Please don’t discuss assignment results until I’ve received them all You can discuss results of optional assignments on the forum any time, if you wish – Send me the following material: GSEA input files (zipped) GSEA output files (zipped) Cytoscape Session Any ppt or doc elaborating on what you did and answering question (please, be concise!)

Assignment Estrogen Treatment Data – Run GSEA Phenotypes: 12 and 24 hrs X treated vs untreated Differential statistic: t-test – Explore results using Enrichment Map Can you reproduce the view in the lesson slides? What can you infer about estrogen effect on the cellular gene expression program? Use the heat-maps to inspect the differences between 12 and 24 hours: what do you notice? What are the implications for the comparison design?

Assignment Estrogen Treatment Data: Source – The original microarray data are available on GEO – The raw.CEL data were processed using rma in R/Bioconductor – The rma gene expression matrix and the gene-set (GMT) file are also available at:

Optional Assignments / 1 Do these assignment if you have time and you wish to explore more – Run GSEA with ratio-of-classes Are the results globally similar? what the differences do you notice in the Enrichment Map? – Make a gene-set (GMT) file with GO and KEGG using R/Bioconductor Are the enriched KEGG pathways insightful? – Run Enrichment Map with different values of the overlap coefficient (e.g. 0.4, 0.6) In our experience, 0.5 is the optimal value for large maps (> 200 gs) Which setting do you like the best? Why?

Optional Assignments / 2 Do these assignment if you have time and you wish to explore more 1.Compute the t-test p-value in R, select the top (a) 750, (b) 2000 up- and down-regulated genes 2.Run the enrichment analysis in ConceptGen 3.Visualize the enrichment as a network in ConceptGen – Can you recognize functional clusters? – Are there similarities with the Enrichment Map view?

At least for this lesson…