Divining Systems Biology Knowledge from High-throughput Experiments Using EGAN Jesse Paquette ISMB 2010 Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center University of California, San Francisco (AKA BCBC HDFCCC UCSF)
High-throughput experiments This talk applies to –Expression microarrays –aCGH –SNP/CNV arrays –MS/MS Proteomics –DNA methylation –ChIP-Seq –RNA-Seq –In-silico experiments If parts of the output can be mapped to gene IDs –You can use EGAN
What do you hope to accomplish? Collect data Process data Differential analysisPublish! Clusters and/or gene lists New testable hypotheses Produce insight about the underlying biology New grants!New papers! Drug targets!
Leverage organic intelligence Clusters and/or gene lists New testable hypotheses Produce insight about the underlying biology Summarize Visualize Contextualize
Producing insight from clusters and gene lists Summarize: find enriched pathways (and other gene sets) –Hypergeometric over-representation DAVID –Global trends GSEA Visualize: gene relationships in a graph –Protein-protein interactions Cytoscape –Network module discovery Ingenuity IPA –Literature co-occurrence PubGene Contextualize: pertinent literature PubMed Google iHOP
EGAN: Exploratory Gene Association Networks Methods: state-of-the-art analysis of clusters and gene lists –Hypergeometric enrichment of gene sets –Global statistical trends of gene sets –Hypergraph visualization (via Cytoscape libraries) –Literature identification –Network module discovery User Interface: responds quickly to new queries from the biologist –Sandbox-style functionality –Dynamic adjustment of p-value cutoffs –Point-and-click interface –All data in-memory for immediate access –Links to external websites Modular: integrates as a flexible plug-and-play cog –All data is customizable –Proprietary data can be restricted to the client location –Java runs on almost every OS (PC, Mac, LINUX) –Can be configured and launched from a different application (e.g. GenePattern) –Analyses can be scripted for automation
Gene sets A gene set is a a set of semantically related genes –e.g. Wnt signaling pathway EGAN contains a database of gene sets –> 100k gene sets by default KEGG, Reactome, NCI-Nature, Gene Ontology, MeSH, Conserved Domain, Cytoband, miRNA targets –You can easily add your own Simple file format Download from MSigDB (Broad Institute)
Gene-gene relationships EGAN also contains –Protein-protein interactions (PPI) –Literature co-occurrence –Chromosomal adjacency –Kinase-target relationships Other possibilities –Sequence homology –Expression correlation
Example with microarray and aCGH results Mirzoeva et al. (2009) Cancer Research –UCSF-LBL collaboration –Analysis of breast cancer cell lines Basal vs. luminal Discoveries in this presentation –miRNA regulator of subtype (mir-200) –Annexin (ANXA1) as potential regulator of ER, glucocorticoid and EGFR signaling
Gene list - higher expression in basal cell lines
Gene set/pathway enrichment
Importing gene lists from publications
Combining expression with aCGH
Finding network modules
Where to find EGAN Website – paper in Bioinformatics –
Acknowledgements BCBC HDFCCC UCSF –Taku Tokuyasu –Adam Olshen –Ritu Roy –Ajay Jain LBNL –Debopriya Das –Joe Gray Funding –UCSF Cancer Center Support Grant UCSF –Early adopters Ingrid Revet Antoine Snijders Stephan Gysin Sook Wah Yee Joachim Silber –Cytoscape gurus David Quigley Scooter Morris –OTM David Eramian Ha Nguyen –Laura van ’t Veer –Donna Albertson –Graeme Hodgson