Pathway and network analysis Functional interpretation of gene lists Bing Zhang Department of Biomedical Informatics Vanderbilt University bing.zhang@vanderbilt.edu
Omics studies generate lists of interesting genes log2(ratio) 92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at 161361_s_at 92202_g_at 103548_at 100947_at 101869_s_at 102727_at 160708_at …... -log10(p value) Microarray RNA-Seq Differential expression Proteomics Lists of genes with potential biological interest Clustering
Organizing genes based on pathways
Advantages of pathway analysis Better interpretation From interesting genes to interesting biological themes Improved robustness Robust against noise in the data Improved sensitivity Detecting minor but concordant changes in a pathway
Pathway databases Databases Limitation BioCarta (http://www.biocarta.com/genes/index.asp) KEGG (http://www.genome.jp/kegg/pathway.html) MetaCyc (http://metacyc.org) Pathway commons (http://www.pathwaycommons.org) Reactome (http://www.reactome.org) STKE (http://stke.sciencemag.org/cm) Signaling Gateway (http://www.signaling-gateway.org) Wikipathways (http://www.wikipathways.org) Limitation Limited coverage Inconsistency among different databases Relationship between pathways is not defined
Gene Ontology Structured, precisely defined, controlled vocabulary for describing the roles of genes and gene products Three organizing principles: molecular function, biological process, and cellular component Dopamine receptor D2, the product of human gene DRD2 molecular function: dopamine receptor activity biological process: synaptic transmission cellular component: plasma membrane Terms in GO are linked by several types of relationships Is_a (e.g. plasma membrane is_a membrane) Part_of (e.g. membrane is part_of cell) Has part Regulates Occurs in
Gene Ontology
Annotating genes using GO terms Two types of GO annotations Electronic annotation Manual annotation All annotations must: be attributed to a source indicate what evidence was found to support the GO term-gene/protein association Types of evidence codes Experimental codes - IDA, IMP, IGI, IPI, IEP Computational codes - ISS, IEA, RCA, IGC Author statement - TAS, NAS Other codes - IC, ND IDA: inferred from direct assay IMP: inferred from mutant phenotype IGI: inferred from genetic interaction IPI: inferred from physical interaction IEP: inferred from expression pattern ISS: inferred from sequence or structure IEA: inferred from electronic annotation RCA: inferred from reviewed computational analysis IGC: inferred from genomic context TAS: traceable author statement NAS: non-traceable author statement IC: inferred by curator ND: no biological data available ND is used when the curator has determined that there is no existing literature to support an annotation. NOT the same as having no annotation at all No annotation means that no one has looked yet
Annotating genes using GO terms …… DLGAP1 discs, large (Drosophila) homolog-associated protein 1 DLGAP2 discs, large (Drosophila) homolog-associated protein 2 DNM1 dynamin 1 DOC2A double C2-like domains, alpha DRD1 dopamine receptor D1 DRD1IP dopamine receptor D1 interacting protein DRD2 dopamine receptor D2 DRD3 dopamine receptor D3 DRD4 dopamine receptor D4 DRD5 dopamine receptor D5 Parent Cell-cell signaling Synaptic transmission 226 human genes Child
Access GO Downloads (http://www.geneontology.org) Web-based access Ontologies http://www.geneontology.org/page/download-ontology Annotations http://www.geneontology.org/page/download-annotations Web-based access AmiGO: http://www.godatabase.org QuickGO: http://www.ebi.ac.uk/QuickGO
Coverage of GO annotations Homo sapiens Mus musculus #term #gene GO/BP 6502 15228 6227 15709 GO/MF 3144 16389 2961 17287 GO/CC 947 16765 882 16801
Over-representation analysis: concept 98 Hoxa5 Hoxa11 Ltbp3 Sox4 Foxc1 Edn1 Ror2 Gnag Smad3 Wdr5 Trp63 Sox9 Pax1 Acd Rai1 Pitx1 …… Sash1 Cd24a Agt Psrc1 Ctla2b Angptl4 Depdc7 Sorbs1 Macrod1 Enpp2 Tmem176a …… 1842 581 Observe compare 65 1842 581 Differentially expressed genes (581 genes) Expect Is the observed overlap significantly larger than the expected value? Development (1842 genes)
Over-representation analysis: method Significant genes Non-significant genes Total genes in the group k j-k j Other genes n-k m-n-j+k m-j n m-n m Hypergeometric test: given a total of m genes where j genes are in the functional group, if we pick n genes randomly, what is the probability of having k or more genes from the group? Observed k n j m Zhang et.al. Nucleic Acids Res. 33:W741, 2005
Over-representation analysis: limitations Arbitrary thresholding Ignoring the order of genes in the significant gene list
Gene Set Enrichment Analysis: concept Do genes in a gene set tend to locate at the top or bottom of the ranked gene list?
Gene Set Enrichment Analysis: method -1/(n-k) +1/k k: Number of genes in the gene set S n: Number of all genes in the ranked gene list Subramanian et.al. PNAS 102:15545, 2005 http://www.broad.mit.edu/gsea/
Pathway-based analysis Organizing genes by Pathways Gene Ontology Enrichment analysis methods Over-representation analysis Gene Set enrichment analysis Major limitation Existing knowledge on pathways or gene functions is far from complete
Biological networks Networks Nodes Edges Physical interaction networks Protein-protein interaction network Proteins Physical interaction, undirected Signaling network Modification, directed Gene regulatory network TFs/miRNAs Target genes Physical interaction, Metabolic network Metabolites Metabolic reaction, Functional association networks Co-expression network Genes/proteins Co-expression, undirected Genetic network Genes Genetic interaction,
Properties of complex networks Human protein-protein interaction network 9,198 proteins and 36,707 interactions Scale-free (hubs) Hierarchical modular Small world (six degree separation)
Network visualization Network visualization tools Cytoscape (http://www.cytoscape.org) Gehlenborg et al. Nature Methods, 7:S56, 2010
Network distance vs functional similarity Proteins that lie closer to one another in a protein interaction network are more likely to have similar function and involve in similar biological process. Network-based gene function prediction Network-based disease gene prediction Sharan et al. Mol Syst Biol, 3:88, 2007
Organizing genes based on network modules Protein-protein interaction modules Transcriptional regulatory modules Transcription factor targets miRNA targets Network module-based analysis Over-representation analysis GSEA TF
WebGestalt: http://www.webgestalt.org 92546_r_at 92545_f_at 96055_at 102105_f_at 102700_at …… Jul. 1, 2013 – Jun. 30, 2014 49,136 visits from 18,213 visitors ~200 ID types Statistical analysis ~60K gene sets Zhang et.al. Nucleic Acids Res. 33:W741, 2005 Wang et al. Nucleic Acids Res. 41:W77, 2013
WebGestalt output: Enriched GO terms Response to unfolded proteins 12 genes adjp=1.32e-08
WebGestalt output: an enriched pathways Input genes TGF Beta Signaling Pathway
WebGestalt output: enriched network modules
GSEA: http://www.broadinstitute.org/gsea
GSEA: output
Summary Organizing genes by “gene sets” Enrichment analysis methods Pathways Gene Ontology Network modules Enrichment analysis methods Over-representation analysis: WebGestalt Gene Set enrichment analysis: GSEA Tools WebGestalt (Over-representation analysis) GSEA (Gene set enrichment analysis) Manuals for WebGestalt and GSEA in the reading folder