Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.
Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data
Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization
Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization Lesson 2: Networks and Pathways Networks: data sources and visualization Pathways
PART 1 Introduction How do we relate microarray expression data to biological function?
Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow
Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow
Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow
Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow
From differential genes to biological functions How do my data relate to known biological functions? Are there specific functions that are characterized by gene expression changes? ?!
Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow
Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS
Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS
Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS
Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS This lecture
Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS Next week lecture
PART 2 Gene-set Enrichment Analysis What is gene-set enrichment analysis? How does it help interpreting microarray data?
What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets -Every set of genes is associated to a specific cellular function, process, component or pathway
What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP
What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns?
What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP
What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP NOT SIGNIFICANT NOT SIGNIFICANT UP DOWN
What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? This is the meaning of significant enrichment We will see what’s the “statistical” definition of enrichment in PART.4
PART 3 Gene-set Enrichment: Data What data sources are available for gene-set enrichment analysis?
Gene-set Data Sources Break down cellular function into gene sets Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP Where can I get these gene-sets? How were the gene-sets compiled? How are they structured?
Gene Ontology (GO) Gene Ontology is: – a hierarchically-structured, Functional categories are organized hierarchically, i.e. a system of inter-related sets with increasing scope specificity (parent-child relations) – controlled vocabulary Functional categories are defined by experts, and then must be used consistently for annotation – for gene product function annotation Gene products (i.e. proteins) are annotated using GO functional categories (“terms”) – It is general for all species
Gene Ontology: Example Terms are organized hierarchically – Terms on top are more general, terms on bottom are more narrow in scope – If a protein is annotated as Spindle, the annotation should be automatically inferred also for all progenitors of Spindle (up-propagation)
Gene Ontology: Example
PARENT CHILD
Gene Ontology: Example PARENT CHILD
Gene Ontology: Example CHILD PARENT Gene Ontology and the corresponding gene-sets
Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 DUCZ Gene Gene-set
Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 The set corresponding to the CHILD is a subset of the one corresponding to the PARENT DUCZ
Gene Ontology: Example
Gene Ontology: Partitions GO has three independent partitions, which are not interconnected: – Molecular Function Describes biochemical activities, in-vitro binding specificities, etc… Example: Ligase Activity, Kinase Activity, DNA Binding – Cellular Component Describes parts of the cell Example: Mitochondrion, Spindle Microtubule – Biological Process Describes processes at the intra-cellular and organism level Example: DNA Replication, Apoptosis, Development
MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS Ligase Activity Mitochondrion DNA Replication
Gene Ontology: Partitions MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS First-level children (list)
Gene Ontology Levels Every partition has several levels… ROOT LEVEL-1 LEVEL-2 LEVEL-N
Gene Ontology Levels However, terms at the same level don’t necessarily have the same degree of granularity (i.e. specificty of scope) BIOLOGICAL PROCESS SIGNALING IMMUNE SYSTEM PROCESS PIGMENTATION Different granularity!!!
Gene Ontology Annotations How are gene annotated with GO terms? Human curators go through the literature and mining for gene functions -Different genomic databases take part to this effort -Evidence Codes are used to keep track of the type of evidence for annotation -IEA annotations are directly imported from databases, without human curation Important Note: Primary annotations are not propagated using the ontology; therefore: when you download GO gene-sets always make sure that up-propagation was done
Gene Ontology Evidence Codes ISS: Inferred from Sequence/Structural Similarity IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern TAS: Traceable Author Statement NAS: Non-traceable Author Statement IC: Inferred by Curator ND: No Data available IEA: Inferred from electronic annotation More at:
Gene Ontology Evidence Codes How should I use evidence codes? – Quality Filter for Gene-set Enrichment Sometimes IEA (Electronic Annotations) are considered less reliable, and are not used for analysis However, this should be evaluated very carefully and cannot be generalized – Gene Browsing If you are interested in the function of a specific gene, you can check if multiple evidences are available
Annotation Inheritance There are primary and inherited annotations – Primary Annotations Originally defined by curators – Inherited Annotations Back-propagated along the hierarchy Always check if the gene ontology annotation resource you are using includes inherited annotations!
Annotation Inheritance Primary Annotation: Spindle
Annotation Inheritance Inherited Annotations: Microtubule Cytoskeleton Cytoskeletal Part Cytoskeleton Intracellular Organelle Part …
Gene Ontology: Multi-function Besides hierarchical term organization, genes can be multi-functional, i.e. annotated by many independent terms – In the following slide we see an excerpt of p53 (the “Warden of Genome”) annotations, as reported by the NCBI database Entrez-Gene
Gene Ontology: Statistics ( 29,922Total Terms 8,688Molecular Function 2,689Cellular Component 18,545Biological Process Annotated Genes (Entrez-Gene) 17,482Human 18,028Mouse
Exploring Gene Ontology: QuickGO
Exploring Gene Ontology: QuickGO
Exploring Gene Ontology: QuickGO New search Essential Data Term in the GO graph
Gene-sets: Beyond Gene Ontology There are many other sources and types of gene-sets: -Pathways (e.g. KEGG) -Protein Families / Domains (e.g. PFAM) -Predicted Targets of Regulators (e.g. MSigDB-c3) -miRNA, Transcription Factors -Protein-protein Interaction Modules -Gene Expression -Up/down after treatment or in relation to disease (e.g. MSigDB-c2) -Co-expression across many conditions (e.g. MSigDB-c4) -Genotype-phenotype association (e.g. DiseaseHub) -Genomic position (e.g. MSigDB-c1)
Pathways and GO Biol. Process How do pathways and processes differ? – In a purely biological perspective, the question is philosophical (still worth speculating…) – In a bioinformatics perspective: A gene is annotated for a GO Biological Processes if the curators deem it (significantly) contributes to the process (which is at the cellular or organ level), according to a number of evidences Pathways include the “wiring” of genes/gene products, hence they rely on a more intensive curation process Some pathways include large ubiquitous actors (such as the proteasome) that may confound enrichment analysis, whereas these are usually absent from GO process
A pathway example: the MAPK cascade in KEGG (
Major Gene-set Resources A-Z Bioconductor – GO: GO.db + org.Xx.eg.db (org.Xx.egGO2ALLEGS) – KEGG: KEGG.db + org.Xx.eg.db (org.Xx.egPATH) – PFAM: PFAMEDE + org.Xx.eg.db (org.Xx.egPFAM) – Note: Xx has to be replaced with the species id {Hs, Mm, Rn, etc…} DiseaseHub ( – Phenotype-genotype (OMIM, GAD, HGMD, PharmGKB, CGP, GWAS) MSigDB ( – GO (*no IEA), Pathways (KEGG, Biocarta, STKE, GenMAPP, PharmGKB, GEArray), Predicted Targets (miRNA: ?, TF: Transfac), Gene Expression, Genomic Positions PathwayCommons ( – Pathways: Reactome, NCI, Cell map WhichGenes ( – GO, Pathways (KEGG, Biocarta, Reactome), Genomic Positions, Regulators (miRNA: TargetScan, miRBase), Phenotype-genotype (geneCards Disease, CancerGenes)
Exploring MSigDB (1)
Exploring MSigDB (2) Alzheimer
Exploring MSigDB (3) Select this gene-set
Exploring MSigDB (4)
Exploring MSigDB (5) I now want to see how the gene-set I was interested in overlaps with other gene-sets in the collection (I selected only a few types)
Exploring MSigDB (6) We will se how this p-value is computed and what it means in the next part (enrichment methods)
Gene-set Resources Tips to navigate the resource ocean / 1: – Start your analysis using only a few, reliable sources (e.g. GO, KEGG) GO also has a very large gene coverage – After the first-pass analysis, expand your gene-set collection to types you are interested in – Don’t try from the beginning everything together – Remember quality and clarity! Target predictions may be unreliable Gene expression-derived sets are often hard to interpret
Gene-set Resources Tips to navigate the resource ocean / 2: – If you are confident with R, start from Bioconductor, and supplement the missing pathways shopping around GO: Bioconductor Pathways: Pathway Commons Phenotype-genotype: DiseaseHub Gene Expression: MSigDB Useful scripts available at:
Gene-set Resources Tips to navigate the resource ocean / 2: – If you are not confident with R, and you are a GSEA user, use MSigDB and Pathway Commons From both resources you can download GMT files (GMT is the format used by GSEA) Remember that GO gene-sets in MSigDB do not have IEA-backed annotations – Both Bioconductor and MSigDB incorporate GO inherited annotations (back-propagated)
Summary of PART 3 Gene-set Data Sources – Gene Ontology, a hierarchically structured controlled vocabulary for gene function annotation, is the main source of gene-sets – Other valuable sources are availables, such as pathway databases In the next part we will see how to use gene-set for enrichment analysis…
Now, take a…
And ready to dive again!
PART 4 Gene-set Enrichment: Methods What statistical methods can I use to score gene-sets for enrichment?
Enrichment Test Spindle Apoptosis Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table
Enrichment Test Spindle Apoptosis ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Experimental Data A priori knowledge + existing experimental data Microarray Experiment (gene expression table) Gene-set Databases
Enrichment Test Spindle Apoptosis Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Interpretation & Hypotheses Microarray Experiment (gene expression table)
Enrichment Test Spindle Apoptosis Enrichment Table FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. SPP1 SPP2 CCCP MTC1 … SPP1 SPP2 CCCP MTC1 … Gene-sets Microarray Experiment (gene expression table)
Enrichment Test ENRICHMENT TEST ENRICHMENT TEST How? Microarray Experiment (gene expression table)
Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test UP DOWN UP DOWN Selection by Threshold
Time-course Design Expression Matrix t1 t2t3…tn Gene Clusters E.g.: - K-means - K-medoids - SOM
Other Designs Expression Matrix Significant Genes E.g.: - ANOVA - Linear Model
Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant)
Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant) Gene-set
Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene-set Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant)
Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes?
Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Random sample of array genes
Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Statistical Model: Fisher’s Exact Test Statistical Model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypotehsis distribution (Hypergeometric Distribution)
Fisher’s Exact Test For Gene-set Enrichment Enrichment P-value ab cd MEMO: P-value ~ 0 --> significant P-value ~ 1 --> not significant a, b, c, d are the size of the fours subsets (each subset has a different color) © by Black Box Inc. R: help (fisher.test)
Fisher’s Exact Test For Gene-set Overlap We can also use Fisher’s Exact Test to evaluate the overlap between gene-sets from databases Going back to MSigDB… Now we know where these p-values come from!
Web Resources for Fisher’s Exact Test ConceptGen Note: free account required DAVID Note: thorough description of how to use in this paper: Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1): PMID:
Beyond Fisher’s Test UP DOWN ENRICHMENT TEST ENRICHMENT TEST Threshold- dependent e.g. Fisher’s Test Threshold- dependent e.g. Fisher’s Test Whole- distribution e.g. GSEA Whole- distribution e.g. GSEA UP DOWN
Beyond Fisher’s Test Whole-distribution methods have been shown to be more stable and statistically powerful – No “natural” value for the threshold – Different results at different threshold settings – Loss of information due to thresholding No resolution between significant signals with different strengths Weak signals neglected --> Use whole-distribution whenever possible
GSEA Enrichment Test / 1 Ranked Gene List Two-class comparison Class-1Class-2 Expression Matrix Correlation to phenotype Quantitative Phenotype - Fold change - Log (ratio) - t-test - SAM -Pearson correlation Expression Matrix
GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List
GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle Apoptosis Gene-set Databases GSEA Enrichment Table Ranked Gene List The p-value depends only on the single gene-set performance The FDR depends on the performance of all gene-sets
GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A Oct 25;102(43)
GSEA: Method ES score calculation Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end?
GSEA: Method ES score calculation Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution to the running ES score
GSEA: Method ES score calculation MAX running ES score --> Final ES Score
GSEA: Method ES score calculation High ES score High local enrichment
GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Number of instances ES Score
GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000)
GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000) Randomized with ES ≥ real: 4 / > Empirical p-value = 0.002
GSEA Settings: Permutation Permutation settings have important implications which we will not discussed in detail Practical suggestions: – When biological replicates are very similar within classes and classes are well seperated --> gene permutation – When biological replicates tend to be dissimilar, or stratified according to hidden experimental factors --> use other whole-distribution enrichment methods of self-contained type (e.g. SAM-GS)
GSEA Settings: Gene-set Filter Gene-set for enrichment analysis are usually filtered by size – Large gene-sets are undesired, if they are derived from Gene Ontology or other functional resources, as they usually correspond to uninformative concepts (e.g. Regulation of Biopolymer Catabolism) – Small gene-sets are undesired as their statistics are quite noisy, and they may decrease the FDR of other sets – See Using GSEA section for the specific value of size filtering settings
Using GSEA Installation Launch Desktop Application from: Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor
Using GSEA Data Format There are three data files you will need: – Gene-set (.GMT) – Gene Expression Table (.txt) – Gene Expression Phenotypes (.CLS) The formats requirements follow. More on GSEA data formats:
Using GSEA Data Format: gene-set file (.GMT) Syntax: > [\tab] > [\tab] > [\tab] > Notes: Either use the gene-set ID for the Name (e.g. GO ID) and the gene- set full name for the Description Or use the gene-set full name for the Name and the source database for the Description Example: regulation of DNA recombinationGO: transition metal ion transportGO:
Using GSEA Data Format: gene expression table file (.txt) Syntax: table > [\tab] > [\tab] > [\tab] > Notes: Use the gene ID for the Name (e.g. GO ID) and the gene symbol and/or full name for the Description I recommend using EntrezGene IDs, for a number of reasons Gene IDs must be consistent between the GMT and this file Example:
Using GSEA Data Format: expression phenotypes file (.CLS) # Tg-A Tg-B WT Tg-A Tg-A Tg-A Tg-B Tg-B Tg-B WT WT WT Use space as separator Phenotype labels for all samples in the gene expression tables Always 1 Number of classes Number of samples Class Labels
Using GSEA Load the data
Using GSEA Load the data
Using GSEA Run the analysis – Parameter setting / 1 Load gene expression table here Load gene-set (.GMT) file here 2000 Load phenotype file (.CLS) here gene.-set If your gene expression table has probe IDs already matching with the.GMT file, set this this to FALSE. If your gene expression table has probe IDs already matching with the.GMT file, you don’t need this.
Using GSEA Run the analysis – Parameter setting / 2 Differential statistic. Use t-test (or signal-to-noise) if you have at least 3 replicates. 10 is usually good. Keep between 7-8 and is usually good. Keep between 500 and 800.
Using GSEA GSEA Pre-ranked – If you wish to use a statistic for differential expression other than GSEA, you can using the Pre-ranked mode More on GSEA pre-ranked data format: #RNK:_Ranked_list_file_format_.28.2A.rnk.29
Summary of PART 4 Methods for Gene-set Enrichment – Fisher’s Exact Test can be used for any given set of experimental genes – When possible, use GSEA to achieve greater power – Both GSEA and Fisher’s Exact Test require to score genes for significance/differentiality; how this is done depends on the microarray design
Now, take a…
And ready to dive again!
PART 5 Gene-set Enrichment: Visualization How to use enrichment analysis to functionally map cellular activity. Or, everything finally coming together.
Gene-set Enrichment: Redundancy Problem Many redundant gene-sets – Gene Ontology has a very large number of gene- sets, often with slight differences – Different pathway databases have different yet overlapping definitions of pathways – Globally, it is useful to grasp the overlap relations between enriched gene-sets --> we need a visualization framework going beyond the enrichment table
GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO: taxis 2.18E GO: chemotaxis 2.18E GO: adaptive immune response based on somatic recombination 7.10E GO: adaptive immune response 7.10E GO: leukocyte mediated immunity GO: B cell mediated immunity GO: myeloid cell differentiation GO: immune effector process GO: regulation of phagocytosis GO: positive regulation of phagocytosis GO: lymphocyte mediated immunity GO: growth factor binding GO: protein polymerization GO: endoplasmic reticulum membrane GO: immunoglobulin mediated immune response GO: heart development GO: response to bacterium GO: regulation of endocytosis GO: acute inflammatory response GO: positive regulation of endocytosis GO: myeloid leukocyte activation GO: amino acid biosynthetic process GO: regulation of inflammatory response GO: activation of immune response GO: positive regulation of immune system process GO: positive regulation of immune response GO: antigen processing and presentation GO: regulation of immune system process GO: regulation of immune response GO: negative regulation of enzyme activity GO: phagocytosis GO: myeloid leukocyte differentiation GO: humoral immune response GO: lymphocyte activation GO: leukocyte chemotaxis GO: negative regulation of protein kinase activity GO: negative regulation of transferase activity GO: transforming growth factor beta receptor signaling pathw GO: insulin-like growth factor binding GO: T cell activation GO: humoral immune response mediated by circulating immunogl GO: cytosolic ribosome (sensu Eukaryota) GO: protein amino acid N-linked glycosylation GO: positive regulation of multicellular organismal process GO: chemokine receptor binding GO: chemokine activity GO: Wnt receptor signaling pathway
GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO: taxis 2.18E GO: chemotaxis 2.18E GO: adaptive immune response based on somatic recombination 7.10E GO: adaptive immune response 7.10E GO: leukocyte mediated immunity GO: B cell mediated immunity GO: myeloid cell differentiation GO: immune effector process GO: regulation of phagocytosis GO: positive regulation of phagocytosis GO: lymphocyte mediated immunity GO: growth factor binding GO: protein polymerization GO: endoplasmic reticulum membrane GO: immunoglobulin mediated immune response GO: heart development GO: response to bacterium GO: regulation of endocytosis GO: acute inflammatory response GO: positive regulation of endocytosis GO: myeloid leukocyte activation GO: amino acid biosynthetic process GO: regulation of inflammatory response GO: activation of immune response GO: positive regulation of immune system process GO: positive regulation of immune response GO: antigen processing and presentation GO: regulation of immune system process GO: regulation of immune response GO: negative regulation of enzyme activity GO: phagocytosis GO: myeloid leukocyte differentiation GO: humoral immune response GO: lymphocyte activation GO: leukocyte chemotaxis GO: negative regulation of protein kinase activity GO: negative regulation of transferase activity GO: transforming growth factor beta receptor signaling pathw GO: insulin-like growth factor binding GO: T cell activation GO: humoral immune response mediated by circulating immunogl GO: cytosolic ribosome (sensu Eukaryota) GO: protein amino acid N-linked glycosylation GO: positive regulation of multicellular organismal process GO: chemokine receptor binding GO: chemokine activity GO: Wnt receptor signaling pathway adaptive immune response based on somatic recombination adaptive immune response leukocyte mediated immunity B cell mediated immunity myeloid cell differentiation immune effector process regulation of phagocytosis positive regulation of phagocytosis lymphocyte mediated immunity
Gene-set Enrichment: Redundancy Problem How to handle the redundancy problem? – Statistical solutions: Correct for inter-redundancy and prioritize the most enriched gene-sets Don’t always work well, not available for all tests --> not discussed here – Visualization solution: visualize gene-set overlap as a network Enrichment Map (Cytoscape plugin)
Enrichment Map
Enrichment Significance Class A (e.g. UP) Class B (e.g. DOWN)
Enrichment Map A B
Application Example Estrogen treatment of Breast Cancer Cells Overall Design: -2 classes (treated, untreated) -3 time points 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 We will start off by analyzing only the 24 hours time point, which has the maximal induction, although its is functionally similar to the 12 hours time-point
Clusters were manually identified and tagged; they represent highly inter-related gene-sets
Condition Comparison Enrichment Map can be used to compare enrichments Use cases: – Different experiments – Different condition comparisons within the same experiment 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 Now we can analyze together the 12 and 24 hours time-points Notice that we are always comparing the treated to the untreated Example: same data-set (Estrogen treatment)
Heat-map Feature Heat-maps can be used to explore gene expression patterns – Microarray data are typically normalized by-row for heat-map visualization i.Subtract the mean ii.Divide by the standard deviation – This setting is available in Enrichment Map
Down Up
Gene Ontology Restructured Gene Ontology is hierarchical, and terms are highly redundant / inter- related / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / inter- dependent terms
Enrichment Map How-to Installation 1.Install Cytoscape 2.Dowload Enrichment Map plugin 3.Copy the plugin into the Cytoscape plugin folder win C:\Program Files\Cytoscape\plugins mac Applications/Cytoscape/plugins
Enrichment Map: How-to Load Data – Open Cytoscape, load the Enrichment Map plugin from the menu: plugins/ Enrichment Map/Load Enrichment Results 1.Format: GSEA – Use the generic if you have generated enrichment results outside GSEA; follow the manual for formatting instructions 2.Load the gene-set file (GMT) 3.Load the expression matrix (tab-sep txt) 4.This is optional 5.Change the settings as follows: – Set the p-value cut-off to – Set the FDR q-value cut-off to 0.05 (5%) – Select the overlap coefficient More at:
Enrichment Map: How-to Browse results – Enrichment Map is a Cytoscape plugin – We will fully learn how to use Cytoscape in the next lesson – In this lesson, we will just see essential functionalities
Nodes can be dragged and dropped, or deleted
Use this panel to move the view of the network around
Heat-map view Click on nodes to access Normalization setting: Row Normalize Data
These parameters can be tuned to include/exclude gene-sets from the map, depending on their enrichment scores
Rerun the layout from: Layout/Cytoscape Layouts/ Force Directed Layout/ Weighted
Summary of PART 5 Visualization of Gene-set Enrichment – Gene-set enrichment is valuable to summarize the functional landscape of cellular activity (in our case, gene expression) – Gene-sets are highly redundant, organizing them as a network highly facilitates navigation and interpretation Software: Enrichment Map
Further Readings Enrichment Analysis (Methods): Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform May;9(3): PMID: Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform Jan;10(1): PMID: Enrichment Map: Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps. Proteomics Feb 1. PMID:
Assignment Rules – Forum discussion: Of course, you are free to discuss general topics on the forums Please don’t discuss assignment results until I’ve received them all You can discuss results of optional assignments on the forum any time, if you wish – Send me the following material: GSEA input files (zipped) GSEA output files (zipped) Cytoscape Session Any ppt or doc elaborating on what you did and answering question (please, be concise!)
Assignment Estrogen Treatment Data – Run GSEA Phenotypes: 12 and 24 hrs X treated vs untreated Differential statistic: t-test – Explore results using Enrichment Map Can you reproduce the view in the lesson slides? What can you infer about estrogen effect on the cellular gene expression program? Use the heat-maps to inspect the differences between 12 and 24 hours: what do you notice? What are the implications for the comparison design?
Assignment Estrogen Treatment Data: Source – The original microarray data are available on GEO – The raw.CEL data were processed using rma in R/Bioconductor – The rma gene expression matrix and the gene-set (GMT) file are also available at:
Optional Assignments / 1 Do these assignment if you have time and you wish to explore more – Run GSEA with ratio-of-classes Are the results globally similar? what the differences do you notice in the Enrichment Map? – Make a gene-set (GMT) file with GO and KEGG using R/Bioconductor Are the enriched KEGG pathways insightful? – Run Enrichment Map with different values of the overlap coefficient (e.g. 0.4, 0.6) In our experience, 0.5 is the optimal value for large maps (> 200 gs) Which setting do you like the best? Why?
Optional Assignments / 2 Do these assignment if you have time and you wish to explore more 1.Compute the t-test p-value in R, select the top (a) 750, (b) 2000 up- and down-regulated genes 2.Run the enrichment analysis in ConceptGen 3.Visualize the enrichment as a network in ConceptGen – Can you recognize functional clusters? – Are there similarities with the Enrichment Map view?
At least for this lesson…