Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization

Outline of these lectures Goal Identifying functional “themes” and “patterns” in microarray data Lesson 1: Gene-set Enrichment Analysis Data sources Statistical methods Visualization Lesson 2: Networks and Pathways Networks: data sources and visualization Pathways

PART 1 Introduction How do we relate microarray expression data to biological function?

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

From differential genes to biological functions How do my data relate to known biological functions? Are there specific functions that are characterized by gene expression changes? ?!

Generate the expression data Collect the biological samples Identify the Differential Genes Identify the Differential Genes Identify the Functional Groups Identify the Functional Groups Define the experimental design Analysis Workflow

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 Score the set depending on the gene expression of its member genes GENE SETSNETWORKS Just visual, or Identify modules satisfying some joint gene expression and topology requirement Just visual, or Score the pathways exploiting gene expression and topology PATHWAYS

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS  This lecture

Identification of Functional Groups Spindle P53 signaling Gene.1 Gene.2 Gene.3 Gene.2 Gene.4 Gene.5 GENE SETSNETWORKSPATHWAYS  Next week lecture

PART 2 Gene-set Enrichment Analysis What is gene-set enrichment analysis? How does it help interpreting microarray data?

What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets -Every set of genes is associated to a specific cellular function, process, component or pathway

What’s Gene-set Enrichment Analysis? Break down cellular function into gene sets - Every set of genes is associated to a specific cellular function, process, component or pathway Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns?

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP NOT SIGNIFICANT NOT SIGNIFICANT UP DOWN

What’s Gene-set Enrichment Analysis? Microarray data can be related to gene sets in order to mine its functional meaning -Which gene-sets summarize at best gene expression patterns? This is the meaning of significant enrichment We will see what’s the “statistical” definition of enrichment in PART.4

PART 3 Gene-set Enrichment: Data What data sources are available for gene-set enrichment analysis?

Gene-set Data Sources Break down cellular function into gene sets Nuclear Pore Cell Cycle Gene.AAA Gene.ABA Gene.ABC Gene.CC1 Gene.CC2 Gene.CC3 Gene.CC4 Gene.CC5 Ribosome P53 signaling Gene.RP1 Gene.RP2 Gene.RP3 Gene.RP4 Gene.CC1 Gene.CK1 Gene.PPP Where can I get these gene-sets? How were the gene-sets compiled? How are they structured?

Gene Ontology (GO) Gene Ontology is: – a hierarchically-structured, Functional categories are organized hierarchically, i.e. a system of inter-related sets with increasing scope specificity (parent-child relations) – controlled vocabulary Functional categories are defined by experts, and then must be used consistently for annotation – for gene product function annotation Gene products (i.e. proteins) are annotated using GO functional categories (“terms”) – It is general for all species

Gene Ontology: Example Terms are organized hierarchically – Terms on top are more general, terms on bottom are more narrow in scope – If a protein is annotated as Spindle, the annotation should be automatically inferred also for all progenitors of Spindle (up-propagation)

Gene Ontology: Example

PARENT CHILD

Gene Ontology: Example PARENT CHILD

Gene Ontology: Example CHILD PARENT Gene Ontology and the corresponding gene-sets

Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 DUCZ Gene Gene-set

Gene Ontology: Example CHILD PARENT ABB1 ACAP3 TRAC1 LUC2 POF5 Gene Ontology and the corresponding gene-sets ZUMM C5A75 The set corresponding to the CHILD is a subset of the one corresponding to the PARENT DUCZ

Gene Ontology: Example

Gene Ontology: Partitions GO has three independent partitions, which are not interconnected: – Molecular Function Describes biochemical activities, in-vitro binding specificities, etc… Example: Ligase Activity, Kinase Activity, DNA Binding – Cellular Component Describes parts of the cell Example: Mitochondrion, Spindle Microtubule – Biological Process Describes processes at the intra-cellular and organism level Example: DNA Replication, Apoptosis, Development

MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS Ligase Activity Mitochondrion DNA Replication

Gene Ontology: Partitions MOLECULAR FUNCTION CELLULAR COMPONENT BIOLOGICAL PROCESS First-level children (list)

Gene Ontology Levels Every partition has several levels… ROOT LEVEL-1 LEVEL-2 LEVEL-N

Gene Ontology Levels However, terms at the same level don’t necessarily have the same degree of granularity (i.e. specificty of scope) BIOLOGICAL PROCESS SIGNALING IMMUNE SYSTEM PROCESS PIGMENTATION Different granularity!!!

Gene Ontology Annotations How are gene annotated with GO terms? Human curators go through the literature and mining for gene functions -Different genomic databases take part to this effort -Evidence Codes are used to keep track of the type of evidence for annotation -IEA annotations are directly imported from databases, without human curation Important Note: Primary annotations are not propagated using the ontology; therefore: when you download GO gene-sets always make sure that up-propagation was done

Gene Ontology Evidence Codes ISS: Inferred from Sequence/Structural Similarity IDA: Inferred from Direct Assay IPI: Inferred from Physical Interaction IMP: Inferred from Mutant Phenotype IGI: Inferred from Genetic Interaction IEP: Inferred from Expression Pattern TAS: Traceable Author Statement NAS: Non-traceable Author Statement IC: Inferred by Curator ND: No Data available IEA: Inferred from electronic annotation More at: http://www.geneontology.org/GO.evidence.shtml

Gene Ontology Evidence Codes How should I use evidence codes? – Quality Filter for Gene-set Enrichment Sometimes IEA (Electronic Annotations) are considered less reliable, and are not used for analysis However, this should be evaluated very carefully and cannot be generalized – Gene Browsing If you are interested in the function of a specific gene, you can check if multiple evidences are available

Annotation Inheritance There are primary and inherited annotations – Primary Annotations Originally defined by curators – Inherited Annotations Back-propagated along the hierarchy Always check if the gene ontology annotation resource you are using includes inherited annotations!

Annotation Inheritance Primary Annotation: Spindle

Annotation Inheritance Inherited Annotations: Microtubule Cytoskeleton Cytoskeletal Part Cytoskeleton Intracellular Organelle Part …

Gene Ontology: Multi-function Besides hierarchical term organization, genes can be multi-functional, i.e. annotated by many independent terms – In the following slide we see an excerpt of p53 (the “Warden of Genome”) annotations, as reported by the NCBI database Entrez-Gene

http://www.ncbi.nlm.nih.gov/gene/7157

Gene Ontology: Statistics (http://www.geneontology.org/GO.downloads.ontology.shtml) 29,922Total Terms 8,688Molecular Function 2,689Cellular Component 18,545Biological Process Annotated Genes (Entrez-Gene) 17,482Human 18,028Mouse

Exploring Gene Ontology: QuickGO http://www.ebi.ac.uk/QuickGO/

Exploring Gene Ontology: QuickGO New search Essential Data Term in the GO graph

Gene-sets: Beyond Gene Ontology There are many other sources and types of gene-sets: -Pathways (e.g. KEGG) -Protein Families / Domains (e.g. PFAM) -Predicted Targets of Regulators (e.g. MSigDB-c3) -miRNA, Transcription Factors -Protein-protein Interaction Modules -Gene Expression -Up/down after treatment or in relation to disease (e.g. MSigDB-c2) -Co-expression across many conditions (e.g. MSigDB-c4) -Genotype-phenotype association (e.g. DiseaseHub) -Genomic position (e.g. MSigDB-c1)

Pathways and GO Biol. Process How do pathways and processes differ? – In a purely biological perspective, the question is philosophical (still worth speculating…) – In a bioinformatics perspective: A gene is annotated for a GO Biological Processes if the curators deem it (significantly) contributes to the process (which is at the cellular or organ level), according to a number of evidences Pathways include the “wiring” of genes/gene products, hence they rely on a more intensive curation process Some pathways include large ubiquitous actors (such as the proteasome) that may confound enrichment analysis, whereas these are usually absent from GO process

A pathway example: the MAPK cascade in KEGG (http://www.genome.jp/kegg/pathway/hsa/hsa04010.html)

Major Gene-set Resources A-Z Bioconductor – GO: GO.db + org.Xx.eg.db (org.Xx.egGO2ALLEGS) – KEGG: KEGG.db + org.Xx.eg.db (org.Xx.egPATH) – PFAM: PFAMEDE + org.Xx.eg.db (org.Xx.egPFAM) – Note: Xx has to be replaced with the species id {Hs, Mm, Rn, etc…} DiseaseHub (http://zldev.ccbr.utoronto.ca/~ddong/diseaseHub/) – Phenotype-genotype (OMIM, GAD, HGMD, PharmGKB, CGP, GWAS) MSigDB (http://broad.harvard.edu/gsea/msigdb/index.jsp) – GO (*no IEA), Pathways (KEGG, Biocarta, STKE, GenMAPP, PharmGKB, GEArray), Predicted Targets (miRNA: ?, TF: Transfac), Gene Expression, Genomic Positions PathwayCommons (http://www.pathwaycommons.org/pc-snapshot/gsea/by_species/) – Pathways: Reactome, NCI, Cell map WhichGenes (www.whichgenes.org) – GO, Pathways (KEGG, Biocarta, Reactome), Genomic Positions, Regulators (miRNA: TargetScan, miRBase), Phenotype-genotype (geneCards Disease, CancerGenes)

Exploring MSigDB (1) http://broad.harvard.edu/gsea/msigdb/index.jsp

Exploring MSigDB (2) Alzheimer

Exploring MSigDB (3) Select this gene-set

Exploring MSigDB (4)

Exploring MSigDB (5) I now want to see how the gene-set I was interested in overlaps with other gene-sets in the collection (I selected only a few types)

Exploring MSigDB (6) We will se how this p-value is computed and what it means in the next part (enrichment methods)

Gene-set Resources Tips to navigate the resource ocean / 1: – Start your analysis using only a few, reliable sources (e.g. GO, KEGG) GO also has a very large gene coverage – After the first-pass analysis, expand your gene-set collection to types you are interested in – Don’t try from the beginning everything together – Remember quality and clarity! Target predictions may be unreliable Gene expression-derived sets are often hard to interpret

Gene-set Resources Tips to navigate the resource ocean / 2: – If you are confident with R, start from Bioconductor, and supplement the missing pathways shopping around GO: Bioconductor Pathways: Pathway Commons Phenotype-genotype: DiseaseHub Gene Expression: MSigDB Useful scripts available at: http://baderlab.org/DanieleMerico/Code/Bioc2GMT http://baderlab.org/DanieleMerico/Code/Read_GMT

Gene-set Resources Tips to navigate the resource ocean / 2: – If you are not confident with R, and you are a GSEA user, use MSigDB and Pathway Commons From both resources you can download GMT files (GMT is the format used by GSEA) Remember that GO gene-sets in MSigDB do not have IEA-backed annotations – Both Bioconductor and MSigDB incorporate GO inherited annotations (back-propagated)

Summary of PART 3 Gene-set Data Sources – Gene Ontology, a hierarchically structured controlled vocabulary for gene function annotation, is the main source of gene-sets – Other valuable sources are availables, such as pathway databases In the next part we will see how to use gene-set for enrichment analysis…

Now, take a…

And ready to dive again!

PART 4 Gene-set Enrichment: Methods What statistical methods can I use to score gene-sets for enrichment?

Enrichment Test Spindle0.00001 Apoptosis0.00025 Microarray Experiment (gene expression table) Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table

Enrichment Test Spindle0.00001 Apoptosis0.00025 ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Experimental Data A priori knowledge + existing experimental data Microarray Experiment (gene expression table) Gene-set Databases

Enrichment Test Spindle0.00001 Apoptosis0.00025 Gene-set Databases ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Interpretation & Hypotheses Microarray Experiment (gene expression table)

Enrichment Test Spindle0.00001 Apoptosis0.00025 Enrichment Table FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. FADD TRADD CYTC1 BAX BAXL CASP9 CASP10 …. SPP1 SPP2 CCCP MTC1 … SPP1 SPP2 CCCP MTC1 … Gene-sets Microarray Experiment (gene expression table)

Enrichment Test ENRICHMENT TEST ENRICHMENT TEST How? Microarray Experiment (gene expression table)

Two-class Design Expression Matrix Class-1Class-2 Genes Ranked by Differential Statistic E.g.: - Fold change - Log (ratio) - t-test UP DOWN UP DOWN Selection by Threshold

Time-course Design Expression Matrix t1 t2t3…tn Gene Clusters E.g.: - K-means - K-medoids - SOM

Other Designs Expression Matrix Significant Genes E.g.: - ANOVA - Linear Model

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant)

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Significant genes (e.g UP) Background genes (array genes not significant) Gene-set

Enrichment Test Gene-set Databases Microarray Experiment (gene expression table) Gene-set Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant)

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes?

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Random sample of array genes

Enrichment Test Significant genes (e.g UP) Overlap between significant genes and gene-set Background genes (array genes not significant) Is this overlap larger than expected by random sampling the array genes? Statistical Model: Fisher’s Exact Test Statistical Model: Fisher’s Exact Test Fisher’s Exact Test does not require to actually perform the random sampling, it is based on a theoretical null-hypotehsis distribution (Hypergeometric Distribution) http://en.wikipedia.org/wiki/Fisher's_exact_test

Fisher’s Exact Test For Gene-set Enrichment Enrichment P-value ab cd MEMO: P-value ~ 0 --> significant P-value ~ 1 --> not significant a, b, c, d are the size of the fours subsets (each subset has a different color) © by Black Box Inc. R: help (fisher.test)

Fisher’s Exact Test For Gene-set Overlap We can also use Fisher’s Exact Test to evaluate the overlap between gene-sets from databases Going back to MSigDB… Now we know where these p-values come from!

Web Resources for Fisher’s Exact Test ConceptGen http://conceptgen.ncibi.org/core/conceptGen/index.jsp Note: free account required DAVID http://david.abcc.ncifcrf.gov/summary.jsp Note: thorough description of how to use in this paper: Huang da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4(1):44-57. PMID: 19131956

Beyond Fisher’s Test UP DOWN ENRICHMENT TEST ENRICHMENT TEST Threshold- dependent e.g. Fisher’s Test Threshold- dependent e.g. Fisher’s Test Whole- distribution e.g. GSEA Whole- distribution e.g. GSEA UP DOWN

Beyond Fisher’s Test Whole-distribution methods have been shown to be more stable and statistically powerful – No “natural” value for the threshold – Different results at different threshold settings – Loss of information due to thresholding No resolution between significant signals with different strengths Weak signals neglected --> Use whole-distribution whenever possible

GSEA Enrichment Test / 1 Ranked Gene List Two-class comparison Class-1Class-2 Expression Matrix Correlation to phenotype Quantitative Phenotype - Fold change - Log (ratio) - t-test - SAM -Pearson correlation Expression Matrix

GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle0.00010.01 Apoptosis0.0250.09 Gene-set Databases GSEA Enrichment Table Ranked Gene List

GSEA Enrichment Test / 2 Gene-setp-valueFDR Spindle0.00010.01 Apoptosis0.0250.09 Gene-set Databases GSEA Enrichment Table Ranked Gene List The p-value depends only on the single gene-set performance The FDR depends on the performance of all gene-sets

GSEA: Method Steps 1.Calculate the ES score 2.Generate the ES distribution for the null hypothesis using permutations see permutation settings 3.Calculate the empirical p-value 4.Calculate the FDR Subramanian A, Tamayo P, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A. 2005 Oct 25;102(43)

GSEA: Method ES score calculation Where are the gene-set genes located in the ranked list? Is there distribution random, or is there an enrichment in either end?

GSEA: Method ES score calculation Every present gene (black vertical bar) gives a positive contribution, every absent gene (no vertical bar) gives a negative contribution to the running ES score

GSEA: Method ES score calculation MAX running ES score --> Final ES Score

GSEA: Method ES score calculation High ES score High local enrichment

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) Distribution of ES from N permutations (e.g. 2000) Number of instances ES Score

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000)

GSEA: Method Empirical p-value estimation (for every gene-set) 1.Generate null-hypothesis distribution from randomized data (see permutation settings) 2.Estimate empirical p-value Real ES score value Distribution of ES from N permutations (e.g. 2000) Randomized with ES ≥ real: 4 / 2000 --> Empirical p-value = 0.002

GSEA Settings: Permutation Permutation settings have important implications which we will not discussed in detail Practical suggestions: – When biological replicates are very similar within classes and classes are well seperated --> gene permutation – When biological replicates tend to be dissimilar, or stratified according to hidden experimental factors --> use other whole-distribution enrichment methods of self-contained type (e.g. SAM-GS)

GSEA Settings: Gene-set Filter Gene-set for enrichment analysis are usually filtered by size – Large gene-sets are undesired, if they are derived from Gene Ontology or other functional resources, as they usually correspond to uninformative concepts (e.g. Regulation of Biopolymer Catabolism) – Small gene-sets are undesired as their statistics are quite noisy, and they may decrease the FDR of other sets – See Using GSEA section for the specific value of size filtering settings

Using GSEA Installation Launch Desktop Application from: http://www.broadinstitute.org/gsea/msigdb/downloads.jsp Notes: – if you have sufficient RAM (*), go for the 1Gb option – running GSEA will take some time (2-5 hrs depending on the system and the memory setting) – you need an internet connection to run GSEA (*)WIN: check using ALT+CTRL+CANC/Task Manager MAC: check using Applications/Utilities/Activity Monitor

Using GSEA Data Format There are three data files you will need: – Gene-set (.GMT) – Gene Expression Table (.txt) – Gene Expression Phenotypes (.CLS) The formats requirements follow. More on GSEA data formats: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats

Using GSEA Data Format: gene-set file (.GMT) Syntax: > [\tab] > [\tab] > [\tab] > Notes: Either use the gene-set ID for the Name (e.g. GO ID) and the gene- set full name for the Description Or use the gene-set full name for the Name and the source database for the Description Example: regulation of DNA recombinationGO:00000186046413458 transition metal ion transportGO:0000041475538540

Using GSEA Data Format: gene expression table file (.txt) Syntax: table > [\tab] > [\tab] > [\tab] > Notes: Use the gene ID for the Name (e.g. GO ID) and the gene symbol and/or full name for the Description I recommend using EntrezGene IDs, for a number of reasons Gene IDs must be consistent between the GMT and this file Example:

Using GSEA Data Format: expression phenotypes file (.CLS) 9 3 1 # Tg-A Tg-B WT Tg-A Tg-A Tg-A Tg-B Tg-B Tg-B WT WT WT Use space as separator Phenotype labels for all samples in the gene expression tables Always 1 Number of classes Number of samples Class Labels

Using GSEA Load the data

Using GSEA Run the analysis – Parameter setting / 1 Load gene expression table here Load gene-set (.GMT) file here 2000 Load phenotype file (.CLS) here gene.-set If your gene expression table has probe IDs already matching with the.GMT file, set this this to FALSE. If your gene expression table has probe IDs already matching with the.GMT file, you don’t need this.

Using GSEA Run the analysis – Parameter setting / 2 Differential statistic. Use t-test (or signal-to-noise) if you have at least 3 replicates. 10 is usually good. Keep between 7-8 and 15. 600 is usually good. Keep between 500 and 800.

Using GSEA GSEA Pre-ranked – If you wish to use a statistic for differential expression other than GSEA, you can using the Pre-ranked mode More on GSEA pre-ranked data format: http://www.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats #RNK:_Ranked_list_file_format_.28.2A.rnk.29

Summary of PART 4 Methods for Gene-set Enrichment – Fisher’s Exact Test can be used for any given set of experimental genes – When possible, use GSEA to achieve greater power – Both GSEA and Fisher’s Exact Test require to score genes for significance/differentiality; how this is done depends on the microarray design

Now, take a…

And ready to dive again!

PART 5 Gene-set Enrichment: Visualization How to use enrichment analysis to functionally map cellular activity. Or, everything finally coming together.

Gene-set Enrichment: Redundancy Problem Many redundant gene-sets – Gene Ontology has a very large number of gene- sets, often with slight differences – Different pathway databases have different yet overlapping definitions of pathways – Globally, it is useful to grasp the overlap relations between enriched gene-sets --> we need a visualization framework going beyond the enrichment table

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO:0042330 taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 GO:0006935 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 GO:0002460 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002250 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002443 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 GO:0019724 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 GO:0030099 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 GO:0002252 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 GO:0050764 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 GO:0050766 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 GO:0002449 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 GO:0019838 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 GO:0051258 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 GO:0005789 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 GO:0016064 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 GO:0007507 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 GO:0009617 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 GO:0030100 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 GO:0002526 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 GO:0045807 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 GO:0002274 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 GO:0008652 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 GO:0050727 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 GO:0002253 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 GO:0002684 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 GO:0050778 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 GO:0019882 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 GO:0002682 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 GO:0050776 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 GO:0043086 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 GO:0006909 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 GO:0002573 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 GO:0006959 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 GO:0046649 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 GO:0030595 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 GO:0006469 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 GO:0051348 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 GO:0007179 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 GO:0005520 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 GO:0042110 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 GO:0002455 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 GO:0005830 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 GO:0006487 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 GO:0051240 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 GO:0042379 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 GO:0008009 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 GO:0016055 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897

GO.id GO.name p.value covercover.rat Deg.mdn Deg.iqr GO:0042330 taxis 2.18E-06 23 0.056930693 54.94499375 9.139238998 GO:0006935 chemotaxis 2.18E-06 23 0.060209424 54.94499375 9.139238998 GO:0002460 adaptive immune response based on somatic recombination 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002250 adaptive immune response 7.10E-05 25 0.111111111 57.32306955 16.97054864 GO:0002443 leukocyte mediated immunity 0.000419328 23 0.097046414 58.27890582 15.58333739 GO:0019724 B cell mediated immunity 0.000683758 20 0.114285714 57.84161096 15.03496347 GO:0030099 myeloid cell differentiation 0.000691589 24 0.089219331 62.22171598 10.35284833 GO:0002252 immune effector process 0.000775626 31 0.090116279 58.27890582 23.86214773 GO:0050764 regulation of phagocytosis 0.000792138 8 0.2 53.54786293 5.742849971 GO:0050766 positive regulation of phagocytosis 0.000792138 8 0.216216216 53.54786293 5.742849971 GO:0002449 lymphocyte mediated immunity 0.00087216 22 0.101851852 57.84161096 16.13171132 GO:0019838 growth factor binding 0.000913285 15 0.068181818 83.0405088 10.58734852 GO:0051258 protein polymerization 0.00108876 17 0.080952381 57.97543252 17.31639968 GO:0005789 endoplasmic reticulum membrane 0.001178198 18 0.036072144 64.02284752 12.05209158 GO:0016064 immunoglobulin mediated immune response 0.001444464 19 0.113095238 58.27890582 15.58333739 GO:0007507 heart development 0.001991562 26 0.052313883 84.02538284 18.60761304 GO:0009617 response to bacterium 0.002552999 10 0.027173913 52.75249873 23.23104637 GO:0030100 regulation of endocytosis 0.002658555 11 0.099099099 56.38041132 16.02486889 GO:0002526 acute inflammatory response 0.002660742 24 0.103004292 57.80098769 24.94311116 GO:0045807 positive regulation of endocytosis 0.002903401 9 0.147540984 54.94499375 6.769909171 GO:0002274 myeloid leukocyte activation 0.002969661 7 0.077777778 54.94499375 16.07042339 GO:0008652 amino acid biosynthetic process 0.003502921 7 0.017241379 45.19797271 31.18248579 GO:0050727 regulation of inflammatory response 0.004999055 7 0.084337349 54.94499375 7.737346076 GO:0002253 activation of immune response 0.00500146 23 0.116161616 60.29679989 18.41103376 GO:0002684 positive regulation of immune system process 0.006581245 27 0.111570248 60.29679989 22.05051447 GO:0050778 positive regulation of immune response 0.006581245 27 0.113924051 60.29679989 22.05051447 GO:0019882 antigen processing and presentation 0.007244488 7 0.029661017 54.94499375 16.58797889 GO:0002682 regulation of immune system process 0.007252134 29 0.099656357 61.05645008 22.65935206 GO:0050776 regulation of immune response 0.007252134 29 0.102112676 61.05645008 22.65935206 GO:0043086 negative regulation of enzyme activity 0.008017022 9 0.040723982 53.28031076 17.48904224 GO:0006909 phagocytosis 0.008106069 10 0.080645161 55.66270253 12.47536747 GO:0002573 myeloid leukocyte differentiation 0.008174948 10 0.092592593 62.86577216 9.401887596 GO:0006959 humoral immune response 0.008396095 16 0.044568245 55.05654091 18.94209565 GO:0046649 lymphocyte activation 0.009044401 29 0.059917355 61.92213317 21.03553355 GO:0030595 leukocyte chemotaxis 0.009707319 7 0.101449275 56.33116709 6.945510559 GO:0006469 negative regulation of protein kinase activity 0.010782155 7 0.046357616 52.22863516 12.58524145 GO:0051348 negative regulation of transferase activity 0.010782155 7 0.04516129 52.22863516 12.58524145 GO:0007179 transforming growth factor beta receptor signaling pathw 0.012630825 13 0.071038251 83.49440788 12.63256309 GO:0005520 insulin-like growth factor binding 0.012950071 9 0.097826087 81.41963394 7.528247832 GO:0042110 T cell activation 0.013410548 20 0.064516129 59.77891783 26.06174863 GO:0002455 humoral immune response mediated by circulating immunogl 0.016780163 10 0.125 54.70766244 14.2572143 GO:0005830 cytosolic ribosome (sensu Eukaryota) 0.016907351 8 0.01843318 61.68933284 7.814673781 GO:0006487 protein amino acid N-linked glycosylation 0.01791078 7 0.044585987 56.50635337 6.780726553 GO:0051240 positive regulation of multicellular organismal process 0.017931228 31 0.096573209 62.2953212 23.86214773 GO:0042379 chemokine receptor binding 0.018849666 12 0.095238095 55.13915015 19.08254406 GO:0008009 chemokine activity 0.018849666 12 0.096774194 55.13915015 19.08254406 GO:0016055 Wnt receptor signaling pathway 0.020088086 18 0.04400978 85.47935979 20.92435897 adaptive immune response based on somatic recombination adaptive immune response leukocyte mediated immunity B cell mediated immunity myeloid cell differentiation immune effector process regulation of phagocytosis positive regulation of phagocytosis lymphocyte mediated immunity

Gene-set Enrichment: Redundancy Problem How to handle the redundancy problem? – Statistical solutions: Correct for inter-redundancy and prioritize the most enriched gene-sets Don’t always work well, not available for all tests --> not discussed here – Visualization solution: visualize gene-set overlap as a network Enrichment Map (Cytoscape plugin) http://baderlab.org/Software/EnrichmentMap

Enrichment Map

Enrichment Significance Class A (e.g. UP) Class B (e.g. DOWN)

Enrichment Map A B

Application Example Estrogen treatment of Breast Cancer Cells Overall Design: -2 classes (treated, untreated) -3 time points 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 We will start off by analyzing only the 24 hours time point, which has the maximal induction, although its is functionally similar to the 12 hours time-point

Clusters were manually identified and tagged; they represent highly inter-related gene-sets

Condition Comparison Enrichment Map can be used to compare enrichments Use cases: – Different experiments – Different condition comparisons within the same experiment 12 hrs24 hrs48 hrs Estrogen-treated333 Untreated333 Now we can analyze together the 12 and 24 hours time-points Notice that we are always comparing the treated to the untreated Example: same data-set (Estrogen treatment)

Heat-map Feature Heat-maps can be used to explore gene expression patterns – Microarray data are typically normalized by-row for heat-map visualization i.Subtract the mean ii.Divide by the standard deviation – This setting is available in Enrichment Map

Down Up

Gene Ontology Restructured Gene Ontology is hierarchical, and terms are highly redundant / inter- related / inter-dependent Enrichment Maps are not hierarchical, yet they neatly group redundant / inter-related / inter- dependent terms

Enrichment Map How-to Installation 1.Install Cytoscape http://www.cytoscape.org/download.php?file=cyto2_6_3 2.Dowload Enrichment Map plugin http://baderlab.org/Software/EnrichmentMap#Plugin_Download 3.Copy the plugin into the Cytoscape plugin folder win C:\Program Files\Cytoscape\plugins mac Applications/Cytoscape/plugins

Enrichment Map: How-to Load Data – Open Cytoscape, load the Enrichment Map plugin from the menu: plugins/ Enrichment Map/Load Enrichment Results 1.Format: GSEA – Use the generic if you have generated enrichment results outside GSEA; follow the manual for formatting instructions 2.Load the gene-set file (GMT) 3.Load the expression matrix (tab-sep txt) 4.This is optional 5.Change the settings as follows: – Set the p-value cut-off to 0.001 – Set the FDR q-value cut-off to 0.05 (5%) – Select the overlap coefficient More at: http://baderlab.org/Software/EnrichmentMap/UserManual

Enrichment Map: How-to Browse results – Enrichment Map is a Cytoscape plugin – We will fully learn how to use Cytoscape in the next lesson – In this lesson, we will just see essential functionalities

Nodes can be dragged and dropped, or deleted

Use this panel to move the view of the network around

Heat-map view Click on nodes to access Normalization setting: Row Normalize Data

These parameters can be tuned to include/exclude gene-sets from the map, depending on their enrichment scores

Rerun the layout from: Layout/Cytoscape Layouts/ Force Directed Layout/ Weighted

Summary of PART 5 Visualization of Gene-set Enrichment – Gene-set enrichment is valuable to summarize the functional landscape of cellular activity (in our case, gene expression) – Gene-sets are highly redundant, organizing them as a network highly facilitates navigation and interpretation Software: Enrichment Map

Further Readings Enrichment Analysis (Methods): Nam D, Kim SY. Gene-set approach for expression pattern analysis. Brief Bioinform. 2008 May;9(3):189-97. PMID: 18202032 Dinu I, Potter JD, Mueller T, Liu Q, Adewale AJ, Jhangri GS, Einecke G, Famulski KS, Halloran P, Yasui Y. Gene-set analysis and reduction. Brief Bioinform. 2009 Jan;10(1):24-34. PMID: 18836208 Enrichment Map: Isserlin R, Merico D, Alikhani-Koupaei R, Gramolini A, Bader GD, Emili A. Pathway analysis of dilated cardiomyopathy using global proteomic profiling and enrichment maps. Proteomics. 2010 Feb 1. PMID: 20127684

Assignment Rules – Forum discussion: Of course, you are free to discuss general topics on the forums Please don’t discuss assignment results until I’ve received them all You can discuss results of optional assignments on the forum any time, if you wish – Send me (daniele.merico@gmail.com) the following material: GSEA input files (zipped) GSEA output files (zipped) Cytoscape Session Any ppt or doc elaborating on what you did and answering question (please, be concise!)

Assignment Estrogen Treatment Data – Run GSEA Phenotypes: 12 and 24 hrs X treated vs untreated Differential statistic: t-test – Explore results using Enrichment Map Can you reproduce the view in the lesson slides? What can you infer about estrogen effect on the cellular gene expression program? Use the heat-maps to inspect the differences between 12 and 24 hours: what do you notice? What are the implications for the comparison design?

Assignment Estrogen Treatment Data: Source – The original microarray data are available on GEO http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE11352 – The raw.CEL data were processed using rma in R/Bioconductor – The rma gene expression matrix and the gene-set (GMT) file are also available at: http://baderlab.org/Software/EnrichmentMap#Sample_Data_Download

Optional Assignments / 1 Do these assignment if you have time and you wish to explore more – Run GSEA with ratio-of-classes Are the results globally similar? what the differences do you notice in the Enrichment Map? – Make a gene-set (GMT) file with GO and KEGG using R/Bioconductor Are the enriched KEGG pathways insightful? – Run Enrichment Map with different values of the overlap coefficient (e.g. 0.4, 0.6) In our experience, 0.5 is the optimal value for large maps (> 200 gs) Which setting do you like the best? Why?

Optional Assignments / 2 Do these assignment if you have time and you wish to explore more 1.Compute the t-test p-value in R, select the top (a) 750, (b) 2000 up- and down-regulated genes 2.Run the enrichment analysis in ConceptGen 3.Visualize the enrichment as a network in ConceptGen – Can you recognize functional clusters? – Are there similarities with the Enrichment Map view?

At least for this lesson…

Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.

Similar presentations

Presentation on theme: "Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T.

Similar presentations

Presentation on theme: "Functional Enrichment and Pathway Analysis – I Daniele Merico PhD, Molecular and Cellular Biology Post-doctoral Research Fellow, CCBR, U. of T."— Presentation transcript:

Similar presentations

About project

Feedback