Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco.

Slides:



Advertisements
Similar presentations
From Words to Meaning to Insight
Advertisements

EGAN Tutorial: Interface Basics October, 2009 Jesse Paquette UCSF Helen Diller Family Comprehensive Cancer Center
EGAN Tutorial: Loading Network Data October, 2009 Jesse Paquette UCSF Helen Diller Family Comprehensive Cancer Center
EGAN tutorial: Loading experiment results October, 2009 Jesse Paquette UCSF Helen Diller Family Comprehensive Cancer Center
Gene Set Enrichment Analysis Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein.
Oncomine Database Lauren Smalls-Mantey Georgia Institute of Technology June 19, 2006 Note: This presentation contains animation.
Pathways analysis Iowa State Workshop 11 June 2009.
Gene Set Enrichment Analysis (GSEA)
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Pathways & Networks analysis COST Functional Modeling Workshop April, Helsinki.
How we assist knowledge collection Serving the monks Chris Evelo Dept of Bioinformatics – BiGCaT Maastricht University.
Five Slides About EGAN Jesse Paquette UCSF Helen Diller Family Comprehensive Cancer Center
EGAN Tutorial: A Basic Use-case October, 2009 Jesse Paquette UCSF Helen Diller Family Comprehensive Cancer Center
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Divining Systems Biology Knowledge from High-throughput Experiments Using EGAN Jesse Paquette ISMB 2010 Biostatistics and Computational Biology Core Helen.
Comparative Expression Moran Yassour +=. Goal Build a multi-species gene-coexpression network Find functions of unknown genes Discover how the genes.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Version 4 for Windows NEX T. Welcome to SphinxSurvey Version 4,4, the integrated solution for all your survey needs... Question list Questionnaire Design.
341: Introduction to Bioinformatics Dr. Natasa Przulj Deaprtment of Computing Imperial College London
>>> Korean BioInformation Center >>> KRIBB Korea Research institute of Bioscience and Biotechnology GS2PATH: Linking Gene Ontology and Pathways Jin Ok.
1Module 2: Analyzing Gene Lists Canadian Bioinformatics Workshops
Cytoscape A powerful bioinformatic tool Mathieu Michaud
Title: GeneWiz browser: An Interactive Tool for Visualizing Sequenced Chromosomes By Peter F. Hallin, Hans-Henrik Stærfeldt, Eva Rotenberg, Tim T. Binnewies,
MATISSE - Modular Analysis for Topology of Interactions and Similarity SEts Igor Ulitsky and Ron Shamir Identification.
A systems biology approach to the identification and analysis of transcriptional regulatory networks in osteocytes Angela K. Dean, Stephen E. Harris, Jianhua.
Copyright OpenHelix. No use or reproduction without express written consent1.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
Copyright OpenHelix. No use or reproduction without express written consent1.
Networks and Interactions Boo Virk v1.0.
Basic features for portal users. Agenda - Basic features Overview –features and navigation Browsing data –Files and Samples Gene Summary pages Performing.
Copyright OpenHelix. No use or reproduction without express written consent1.
Domain 3 Understanding the Adobe Dreamweaver CS5 Interface.
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Visualization and analysis of microarray and gene ontology data with treemaps Eric H Baehrecke, Niem Dang, Ketan Babaria and Ben Shneiderman Presenter:
CellFateScout step- by-step tutorial for a case study Version 0.94.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Copyright OpenHelix. No use or reproduction without express written consent1.
GeWorkbench Highlights caBIG ® Molecular Analysis Tools Knowledge Center AACR Annual Meeting, April 3, 2011.
UBio Training Courses Micro-RNA web tools Gonzalo
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Copyright OpenHelix. No use or reproduction without express written consent1.
Tutorial session 3 Network analysis Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
Analysis of GEO datasets using GEO2R Parthav Jailwala CCR Collaborative Bioinformatics Resource CCR/NCI/NIH.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
SUPPLEMENTAL FIGURES AND TABLES. Supplementary Table 1: List of new and improved features in GSEA-P version 2 Java software. Examples and screenshots.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
Copyright OpenHelix. No use or reproduction without express written consent1.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Copyright OpenHelix. No use or reproduction without express written consent1.
CSC, Dec.15-16,2005. Cytoscape Team Trey Ideker Mark Anderson Nerius Landys Ryan Kelley Chris Workman Past contributors: Nada Amin Owen Ozier Jonathan.
Tutorial 8 Gene expression analysis 1. How to interpret an expression matrix Expression data DBs - GEO Clustering –Hierarchical clustering –K-means clustering.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Title: Assign Pathways to Gene Set June 21, 2007 Guanming Wu.
Network construction and exploration using CORNET and Cytoscape - Excercises SPICY WORKSHOP Wageningen, March 8 th 2012 Stefanie De Bodt.
Gene Set Analysis using R and Bioconductor Daniel Gusenleitner
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Visualizing data from Galaxy
Ingenuity Pathway Analysis Alex Pico. Description "IPA is a software application that enables researchers to analyze and understand the complex biological.
Canadian Bioinformatics Workshops
Building Dashboards with JMP 13 Dan Schikore SAS, JMP
a Cytoscape plugin to assess enrichment of
Networks and Interactions
Network biology An introduction to STRING and Cytoscape
Presentation transcript:

Exploratory Gene Association Networks October 2009 Jesse Paquette Helen Diller Family Comprehensive Cancer Center University of California, San Francisco

What EGAN is Software that runs on a biologist’s computer –Java 6 and Java WebStart –Utilizes Cytoscape libraries for graph rendering A searchable library of genes and gene annotation –Links out to web resources (Entrez/PubMed/KEGG/Google/etc.) A visualization tool that shows how genes and annotation terms are related –User constructs dynamic hypergraphs using experiment results and enrichment statistics

Why EGAN was made To accelerate exploratory assay analysis by providing a pre-compiled knowledge network As an alternative to presentation of exploratory assay results as gene lists To allow researchers to combine multiple analysis results from potentially different platforms

Exploratory assays AKA high-throughput experiments –Measure hundreds to millions of entities Empirical assays –Expression microarrays –aCGH –MS/MS proteomics –Yeast two-hybrid interaction assays –QTL/SNP associations –DNA Methylation –ChIP chips –Next-gen sequencing In-silico algorithms –Sequence –Structure –Literature

The exploratory assay workflow

Post-computational analysis questions Given a set of entities (genes): S –How are the entities in S related to each other? –What annotation terms/pathways are enriched in S? –How are the entities in S and the annotation terms related? –Are there any pertinent literature references? –Are there any entities not in S that have relationships with multiple entities in S? –How does S compare to the set published by Soandso et al.? –What changes when entities are added to or removed from S? (e.g. when the p-value cutoff is changed)

EGAN lets the biologist investigate results quickly and independently Point-and-click interface –Buttons –Context-specific pop-up menus –Spreadsheet-like data tables –Graph visualization All network information is pre-collated –No programming/scripting –No data transfer/download steps Automated gene-level integration of multiple experiment results

How are computational analysis results commonly presented to the biologist?

Gene lists –Show gene annotation (but too much at once) –Do not show gene-gene relationships Enriched annotation lists –Do not identify the genes annotated with each term –Do not show which genes share annotation terms Gene graphs –Show gene-gene relationships –Do not adequately show annotation

Gene lists

Reducing information by significance cutoff

Reducing information by taking away genes Prevents the user from wasting time investigating actual negatives But what about genes that just missed a stringent cutoff? –These genes are likely to have some importance –Biologists are often given the impression that genes that fail to pass the cutoff are negatives Valuable information is lost by only focusing on a “significant” set –See Gene Set Enrichment Analysis (GSEA), Subramanian (2005)

Enriched annotation lists

What is enrichment? Annotation terms/pathways define sets of genes Enrichment –Overrepresentation Set-based enrichment –Given a significant set, S of genes (or a cluster) –Use hypergeometric distribution to compute overlap between each gene set, T and S Global empirical enrichment –Use generated statistics for each gene in the assay –Summarize the statistics for all genes in each set, T –Test to see if the statistics show a non-random trend –GSEA

Enriched annotation lists

Gene graphs

Canonical pathway maps Start with fixed pathway graph Color the gene nodes by empirical values (only significant genes?) Enriched annotation terms not shown Most useful when –This pathway is expected to be affected in experiment –Little interest in other pathways/unassigned genes –Most genes in pathway graph have significant empirical data values –These conditions are rare in exploratory experiments GenMAPP, Dahlquist (2002)

Association enrichment graphs Calculate enrichment of terms Nodes are annotation terms Edges are ontological relationships Color represents enrichment score What about other annotation types? Which genes are implicated? BiNGO, Maere (2005)

Custom gene set graphs Start with significant set of genes or cluster Show gene-gene relationships as edges How is gene annotation shown? –Hypergraphs Ingenuity IPA, PubGene, Jensen (2001)

Hypergraphs A graph is a collection of nodes and edges A hypergraph is a graph with hyperedges A hyperedge is a set of nodes –Annotation terms and pathways are hyperedges Choice of hypergraph visualization method (HVM) is critical as the number of nodes and hyperedges scales upwards

Hypergraph visualization methods

HVM: Venn diagram Draw a curve around nodes in a set Shows hyperedge overlap effectively Limited to 3 hyperedges No legend required

HVM: Clique Use edges to fully connect all nodes in a set Scales poorly –For a hyperedge with n nodes, 0.5n 2 – 0.5n edges must be used Layout algorithms use additional edges Legend required

HVM: Node-coloring Give all nodes in a set the same color or shape (Ingenuity uses shapes) Scales poorly –Nodes associated with multiple hyperedges must be divided –Hyperedge count limited to number of distinguishable colors Layout algorithms do not use hyperedges Shows hyperedge overlap poorly Legend required

HVM: Association node Hyperedges as association nodes on the graph –Connect each association node to its node members –Incomplete, semi-bipartite graph –Association nodes given different shapes/colors Scales well –For a hyperedge with n nodes, 1 node and n edges must be used Extra association nodes/edges complicate dense graphs –Exploratory assay gene graphs are sparse Layout algorithms use hyperedges No legend required

HVM comparison

EGAN

EGAN features Entire pre-collated hypergraph is available in memory –Mostly defined by NCBI Entrez Gene –Allows dynamic selection of genes and genes sets Useful interface tools for finding genes and terms/pathways of interest –Advanced queries using mouse clicks –Spreadsheet-like tables –Selective addition and removal of information Association node HVM –Thought-provoking display of genes and annotation Node and Edge references –Nodes link to NCBI/UCSC/AmiGO/KEGG/etc. –Edges can link to PubMed

Mockup from 12/2007

EGAN as of 10/2009

Data in the default human gene association network as of 06/08/2009 Node TypeSource# Nodes# EdgesNode LinksEdge Links GeneNCBI Entrez Gene405560Entrez Gene, UCSCN/A MeSHNCBI PubMed MeSHPubMed ID Conserved Domain NCBI Conserved Domain Database CDDNone Gene Ontology ProcessNCBI Entrez Gene AmiGOPubMed ID MIMNCBI Entrez Gene OMIMNone Gene Ontology FunctionNCBI Entrez Gene AmiGOPubMed ID CytobandNCBI Entrez Gene None Gene Ontology ComponentNCBI Entrez Gene AmiGOPubMed ID KEGGNCBI Entrez Gene KEGGNone NHGRI GWA CatalogNCBI Entrez Gene PubMedNone ReactomeNCBI Entrez Gene493594ReactomeNone PubMed Co-occurrenceNCBI Entrez Gene N/APubMed ID Chromosomal SequenceNCBI Entrez Gene042468N/APubMed ID BioGRIDNCBI Entrez Gene024401N/APubMed ID IntActEBI IntAct022229N/ANone HPRDNCBI Entrez Gene017380N/APubMed ID MINT N/ANone BINDNCBI Entrez Gene03879N/APubMed ID Total

The data is fully customizable The pre-collated network –Stored as flat, tab delimited text –Users can specify alternative/supplemental data files Updates are easily pushed to the end users –Using Java WebStart –Compressed in.jar files (.zip) Additional gene sets are already available at MSigDB –Broad Institute, non redistributable –EGAN loads gene sets in.gmt and.gmx file formats

Using EGAN: The simple case

Three EGAN use cases 1) Characterize a gene using protein interaction neighbors 2) Characterize an pre-collated gene set 3) Characterize gene set defined by experiment results

Characterize a gene using protein interaction neighbors Find gene PPARG in the Entrez Gene Node Table Show PPARG and all gene neighbors Hide protein-protein interaction edges Calculate enrichment for all gene sets Use enrichment statistics to selectively show association nodes on the graph

PPARG and all protein interaction neighbors

Characterize an pre-collated gene set Find the conserved domain DDHD in the Conserved Domain Node Table Show DDHD and all gene neighbors Hide DDHD association node Calculate enrichment for all gene sets Use enrichment statistics to selectively show association nodes on the graph

Genes with the DDHD domain

Characterize gene set from empirical data Genes reported by Beier et al. (2007) Format custom gene sets Format empirical data (after computational analysis) Load custom gene set file and empirical file in EGAN Find custom gene sets in Custom Node Node Table Show custom sets and all gene neighbors –Border color shows statistic –Border width shows p-value Hide custom set association nodes Calculate enrichment for all gene sets Use enrichment statistics to selectively show association nodes on the graph

Gene sets from Beier et al. (2007)

Additional functionality in EGAN Comparison of multiple experiments/gene sets –Different normalization methods –Different analysis parameters –Different platforms –Published experiments/gene sets Discovery of third-party genes not present in S Characterization of sequence-derived gene sets –Transcription regulation motifs –Translation regulation motifs –Clusters Scripting for automatic network generation

Future plans More diverse, more complete, higher quality data –Species beyond H. sapiens –Activation/inhibition/modification relationships Examples with non-microarray empirical data –SNP, aCGH, MS/MS Quantitative analysis of the hypergraph Mapping of samples into gene set space Restriction of edges by quality parameters Cytoscape 3.0 plug-in? Improved graph layout algorithms

Where to get EGAN –Downloads –Documentation –Discussion forum The EGAN manuscript is currently under review at Bioinformatics

Acknowledgements UCSF HDFCCC BCB –Taku Tokuyasu –Adam Olshen –Ajay Jain Use of Cytoscape libraries –David Quigley –Scooter Morris –Alex Pico –Alan Kuchinsky Testing –Donna Albertson –Antoine Snijders –Ingrid Revet –Stephan Gysin –Ritu Roydasgupta –Sook Wah Yee –Scot Federman –Mike Baldwin Interpretation of GBM stem cell experiments –Joachim Silber Figure editing –Ben Kopman

Methods

Example custom gene set file format

Example empirical file format

Mapping empirical data to genes Exploratory assays don’t directly measure genes Entities may map to multiple genes –EGAN adds the entity statistic/p-value to all genes Multiple entities may map to a single gene –EGAN generates summary statistics/p-values Statistic median (default) P-value median Maximum/minimum |statistic| Minimum/maximum p-value Statistic/p-value mean Entity-to-gene mapping is customizable –Tab-based text format

Set-based enrichment Given a set of genes made visible on graph

Global empirical enrichment Set Enrichment by Empirical Data (SEED) ParaSEED –Take statistic for each gene in a set S –Calculate summary statistics (s-mean, standard deviation, n) –Two-tailed t-test probability that S is drawn randomly from a normal distribution centered on 0

Global empirical enrichment PermuSEED –Take statistic for each gene in a set S –Calculate summary statistics (s-mean, n) –Randomly sample n genes from background p times –Score is fraction of sample means were lower than s-mean Score of (p = 1000) means 1 of the 1000 random sample means was lower than s-mean Score of (p = 1000) means 999 of the 1000 random sample means were lower than s-mean PermuSEED absolute –Use |statistic| for each gene in S –Pathway gene sets are likely to have activators and inhibitors –PermuSEED absolute finds gene sets that are strongly affected –Parametric version might use variance

Multiple testing adjustment Set-based enrichment –Can’t use q-value due to non-uniform distribution of p-values –Optional permutation-based minP method Westfall & Young (1993) When specifically requested by user Global empirical enrichment (SEED) –q-value Automatically generated