NCRI Cancer Conference November 1, 2015
2Module #: Title of Module
Module 2: Pathway and network analysis Irina Kalatskaya, PhD Ontario Institute for Cancer Research, Canada
NCRI Workshop 2015 bioinformatics.ca Content Introduction to pathway and network analysis in cancer genomics. Sources of pathway and network information: GO biological process, network databases, pathway databases. Overview of enrichment analysis to find over- represented pathways. Pathway analysis of large-scale cancer genomics data sets.
NCRI Workshop 2015 bioinformatics.ca Why Pathway Analysis? Dramatic data size reduction: 1000’s of genes => dozens of pathways; Increase statistical power by reducing multiple hypotheses; Genes seldom operate on it's own Find meaning in the “long tail” of rare cancer mutations; Generate biologically meaningful hypothesis and helps to identify the mechanism.
NCRI Workshop 2015 bioinformatics.ca What do we need for pathway analysis? List of altered genes, proteins, RNA, etc A source of pathways or networks (publicly or commercially available) Biological question/hypothesis !
NCRI Workshop 2015 bioinformatics.ca 1. Biological Question/Hypothesis What do you want to accomplish with your list (hopefully part of experiment design! ) – Summarize biological processes or other aspects of gene function; – Perform differential analysis – what pathways are different between samples, naïve/treated cell lines? – Find a controller for a process (TF, miRNA); – Find new pathways or new pathway members; – Discover new gene function; – Correlate with a disease, clinical data attributes.
NCRI Workshop 2015 bioinformatics.ca 2. Where Do Gene Lists Come From? From high-throughput studies: gene expression profiling, DNA/RNA sequencing, genome-wide association studies (GWAS), ChIP-Seq studies, etc; From public data portals like ICGC, TCGA, Cosmic, etc based on user’s search queries; From the manual and/or automated (PubTator) literature review (gene lists describing disease or condition). Other examples?
NCRI Workshop 2015 bioinformatics.ca Content Introduction to pathway and network analysis in cancer genomics. Sources of pathway and network information: a) GO biological process, b) pathway databases and c) network databases. Overview of enrichment analysis to find over- represented pathways. Pathway analysis of large-scale cancer genomics data sets.
NCRI Workshop 2015 bioinformatics.ca What is the Gene Ontology (GO)? Dictionary: term definitions Set of biological phrases (terms) which are applied to genes: like protein kinase, apoptosis, membrane; GO is not static!!!! All major eukaryotic model organism species are covered; The GO ontology files are freely available from the GO website What is ‘Ontology’? A data model that represents knowledge as a set of concepts within a domain and the relationships between these concepts
NCRI Workshop 2015 bioinformatics.ca What GO Covers? GO terms divided into three aspects: – cellular component – molecular function – biological process glucose-6-phosphate isomerase activity Cell division
NCRI Workshop 2015 bioinformatics.ca GO Structure Terms are related within a hierarchy – is-a – part-of Describes multiple levels of detail of gene function Terms can have more than one parent or child
NCRI Workshop 2015 bioinformatics.ca Pathway Databases Advantages: – Usually curated. – Biochemical view of biological processes. – Cause and effect captured. – Human-interpretable visualizations. Disadvantages: – Sparse coverage of genome. – Different databases disagree on boundaries of pathways.
NCRI Workshop 2015 bioinformatics.ca PATHWAY DATABASE EXAMPLE: Reactome Hand-curated pathways in human. Rigorous curation standards – every reaction traceable to primary literature. As October 2015, there are 1887 human pathways; 8609 human proteins (version 54). Open access.
NCRI Workshop 2015 bioinformatics.ca G1/S DNA damage checkpoint
NCRI Workshop 2015 bioinformatics.ca Pathways vs. Networks - Detailed, high-confidence consensus - Biochemical reactions - Small-scale, fewer genes - Concentrated from decades of literature - Simplified cellular logic, noisy - Abstractions: directed, undirected - Large-scale, genome-wide - Constructed from omics data integration
NCRI Workshop 2015 bioinformatics.ca Network Databases Can be built automatically or via curation. More extensive coverage of biological systems. Relationships and underlying evidence more tentative. Popular sources of curated networks: – BioGRID – Curated interactions from literature; 529,000 genes, 167,000 interactions. – InTact – Curated interactions from literature; 60,000 genes, 203,000 interactions. – MINT – Curated interactions from literature; 31,000 genes, 83,000 interactions. – Reactome FI network – Curated + machine learning, ~11,000 human genes, 180,000 interactions.
Reactome Functional Interaction (FI) Network ~5% of the network is shown
NCRI Workshop 2015 bioinformatics.ca Takeaway message: There are GO-, pathway- and network-based ways to analyze your gene list. DO ALL THREE!!!
NCRI Workshop 2015 bioinformatics.ca Content Introduction to pathway and network analysis in cancer genomics. Sources of pathway and network information: GO biological process, network databases, pathway databases. Overview of enrichment analysis to find over- represented pathways. Pathway analysis of large-scale cancer genomics data sets.
NCRI Workshop 2015 bioinformatics.ca Enrichment Test (ICGC portal)
NCRI Workshop 2015 bioinformatics.ca Enrichment Test (introduction) PATHWAYSP-value Cell cycle Apoptosis Microarray, RNA-seq, CNV, WES experiments (gene list) Gene-set (pathway) databases {Reactome, KEGG} ENRICHMENT TEST ENRICHMENT TEST Enrichment Table Background list (all genes test)
NCRI Workshop 2015 bioinformatics.ca Hypergeometrical test My gene list N = 1000 m = 100 n = 5 k = 3 Background list: 1000 genes of those 100 belong to EGFR- signaling Null hypothesis: list is a random sample from population Alternative hypothesis: more “pathway” genes than expected p-value =
NCRI Workshop 2015 bioinformatics.ca Hypergeometrical test (on-line) Online tools (just google “hypergeometrical test calculator”):
NCRI Workshop 2015 bioinformatics.ca Multiple test corrections Random draws 109,890 draws later p-value = 9.1e-6 Expect a random draw with observed enrichment once every 1 / P-value draws Background list: 1000 genes of those 100 belong to EGFR- signaling
NCRI Workshop 2015 bioinformatics.ca FDR vs Bonferroni correction FDR is the expected proportion of the observed enrichments due to random chance. Compare to Bonferroni correction which is a bound on the probability that any one of the observed enrichments could be due to random chance; Bonferroni correction is very stringent and can “wash away” real enrichments leading to false negatives.
NCRI Workshop 2015 bioinformatics.ca Takeaway message 2: Hypergeometrical test is a powerful statistical tool: use it (not only for the pathway analysis); Don’t forget multiple test correction: FDR or q-score should drive your decision (not p-value); Keep in mind N: number of genes/proteins in your total population. Might influence your final output.
NCRI Workshop 2015 bioinformatics.ca Content Introduction to pathway and network analysis in cancer genomics. Sources of pathway and network information: GO biological process, network databases, pathway databases. Overview of enrichment analysis to find over- represented pathways. Pathway analysis of large-scale cancer genomics data sets.
NCRI Workshop 2015 bioinformatics.ca Christina Yung
NCRI Workshop 2015 bioinformatics.ca Pathway/network analysis workflow overview Browse significant pathways in Reactome Run enrichment analysis using g-Profiler, ICGC, Reactome, gtools, etc Run enrichment analysis using g-Profiler, ICGC, Reactome, gtools, etc Browse significant pathways in Reactome Build protein interaction subnetwork Run clustering algorithm Run enrichment analysis of each module individually Run enrichment analysis of each module individually Drill down to understand molecular mechanism Validate your model (in wet lab) Reactome FI network cytoscape plugin: Reactome- FIViz
Module 1: Hedgehog, TGFβ signaling Module 2: p53 signaling Module 0: ERBB, FGFR, EGFR signaling, Axon guidance Module 4: Translation Module 7: ECM, focal adhesion, integrin signaling Module 3: Wnt & Cadherin signaling Module 6: Ca2+ signaling Module 5: Axon guidance Module 8: MHC class II antigen presentation Module 10: Spliceosome Module 9: Rho GTPase signaling Pancreatic cancer specific subnetwork:
NCRI Workshop 2015 bioinformatics.ca Takeaway message 3: Try different tools: gtool, g-profiler, GeneMania, Reactome FI network, etc Issue of non-relevant enriched pathways! If no significant pathways were detected (and all possible mistakes were excluded), please, don’t get disappointment. Maybe your pathway hasn’t been curated yet. All lectures on pathway- and network-based analysis are available here (free access): network-analysis-omic-data-2015