Sage Infrastructure Tools Project

Slides:



Advertisements
Similar presentations
Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
Advertisements

ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
Jennifer A. Dunne Santa Fe Institute Pacific Ecoinformatics & Computational Ecology Lab Rich William, Neo Martinez, et al. Challenges.
Web 2.0 Door Naima Kasrioui en Xiang Liang Wang. Inhoudsopgave 1.Inleiding 2.Wat is web 2.0? 3.Voorbeelden Google Flickr Linkedin 4.Verschijnselen web.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Jiten Bhagat University of myExperiment A Social VRE for Research Objects JISC Roadshow | February.
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
Libraries and Institutional Content Management Systems
AgriDrupal - a “suite of solutions” for agricultural information management and dissemination, built on the Drupal CMS; - the community of practice around.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Cytoscape A powerful bioinformatic tool Mathieu Michaud
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft and Dr Aleksandra.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Life Sciences Integrated Demo Joyce Peng Senior Product Manager, Life Sciences Oracle Corporation
Sage Bionetworks Mission Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by.
Taverna and my Grid Basic overview and Introduction Tom Oinn
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
EGAN: Exploratory Gene Association Networks by Jesse Paquette Biostatistics and Computational Biology Core Helen Diller Family Comprehensive Cancer Center.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
GenePattern Overview for MAGE-TAB Workshop Ted Liefeld January 24, 2007.
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
A framework to support collaborative Velo: Knowledge Management for Collaborative (Science | Biology) Projects A framework to support collaborative 1.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Tutorial session 2 Network annotation Exploring PPI networks using Cytoscape EMBO Practical Course Session 8 Nadezhda Doncheva and Piet Molenaar.
19/10/20151 Semantic WEB Scientific Data Integration Vladimir Serebryakov Computing Centre of the Russian Academy of Science Proposal: SkTech.RC/IT/Madnick.
Building and Running caGrid Workflows in Taverna 1 Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA 2 Mathematics.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
EADGENE and SABRE Post-Analyses Workshop 12-14th November 2008, Lelystad, Netherlands 1 François Moreews SIGENAE, INRA, Rennes Cytoscape.
Copyright OpenHelix. No use or reproduction without express written consent1.
Introduction to caArray caBIG ® Molecular Analysis Tools Knowledge Center April 3, 2011.
GeWorkbench Highlights caBIG ® Molecular Analysis Tools Knowledge Center AACR Annual Meeting, April 3, 2011.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
Bioinformatics Core Facility Guglielmo Roma January 2011.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Data provenance in biomedical discovery Donald Dunbar Queen’s Medical Research Institute University of Edinburgh Workshop on Principles of Provenance in.
Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,
Developed at the Broad Institute of MIT and Harvard Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, and Mesirov JP. GenePattern 2.0. Nature Genetics 38.
ACGT: Open Grid Services for Improving Medical Knowledge Discovery Stelios G. Sfakianakis, FORTH.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
Biological Networks & Systems Anne R. Haake Rhys Price Jones.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
Sage Congress 2012 Session 1: Synapse Michael Kellen, PhD Director of Technology, Sage Bionetworks SYNAPSE SHARED COLLABORATION SPACE GITHUB.
BBN Technologies Copyright 2009 Slide 1 The S*QL Plugin for Cytoscape Visual Analytics on the Web of Linked Data Rusty (Robert J.) Bobrow Jeff Berliner,
GeWorkbench John Watkinson Columbia University. geWorkbench The bioinformatics platform of the National Center for the Multi-scale Analysis of Genomic.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
NeuroLOG ANR-06-TLOG-024 Software technologies for integration of process and data in medical imaging A transitional.
GeWorkbench Overview Support Team Molecular Analysis Tools Knowledge Center Columbia University and The Broad Institute of MIT and Harvard.
Introduction and Applications of Microarray Databases Chen-hsiung Chan Department of Computer Science and Information Engineering National Taiwan University.
Satisfying Requirements BPF for DRA shall address: –DAQ Environment (Eclipse RCP): Gumtree ISEE workbench integration; –Design Composing and Configurability,
ISMB Demo, 01 July 2009 Franck Tanoh University of Manchester, UK.
High Risk 1. Ensure productive use of GRID computing through participation of biologists to shape the development of the GRID. 2. Develop user-friendly.
High throughput biology data management and data intensive computing drivers George Michaels.
GenePattern Overview caBIG Silver Compatibility review Ted Liefeld Cancer Informatics Program The Broad Institute of MIT and.
Canadian Bioinformatics Workshops
Store and exchange data with colleagues and team Synchronize multiple versions of data Ensure automatic desktop synchronization of large files B2DROP is.
Ingenuity Pathway Analysis Alex Pico. Description "IPA is a software application that enables researchers to analyze and understand the complex biological.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
Joslynn Lee – Data Science Educator
Professor Carole Goble University of Manchester, UK
LOD reference architecture
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Presentation transcript:

Sage Infrastructure Tools Project Project C Sage Infrastructure Tools Project Carole Goble, University of Manchester, UK Ted Liefeld, Broad Institute Alex Pico, Gladstone Institutes Marc Hadfield, Alitora

Tools Afternoon Session Review of developments to date Creating Semantic Model for Sage Networks Storing Sage Networks with Alitora for Search & Visualization Performing Key Driver Analysis with GenePattern Taverna workflow for annotating and analyzing the network model Working with Sage Networks in Cytoscape Other network model tools Additional tool providers discuss integrating with Sage Looking forward open questions and gaps breakout sessions

Project Workstream C: Tools Raw Datasets Annotated & Standardized Network Inference From the first two workstreams, you’ve seen how a variety of primary datasets will be handled and processed, including how they will be annotated in a standardized fashion. This work will result in a large set of inferred networks, including coexpression, bayesian and causality networks. The focus of Project C it to organize the network content, making it easy to access and easy to work with. To this end, we’ve specified some file formats for networked data, including a semantic representation that integrates information across networks, and we’ve built a couple example pipelines demonstrating the analysis and visualization of Sage Commons networks. Infrastructure Tools Access & Analysis 3

Core principles Maximize access Maximize use Maximize reuse Distribute multiple file formats Make use of existing standards and tools Design for flexible, extensible solutions Support collaboration and community annotation

Overview Taverna Sage Commons Work Group C - Tools Formats Identifiers Alitora Query Networks Store Annotations Web/API Access Gene Pattern Key Driver Analysis Integration with Cytoscape Taverna Service / Tool integration Workflow re-use Large-scale and systematic data analyses Formats Identifiers Services Taverna

The SAGE Pipeline Taverna FORMAT Cytoscape FORMAT Visualisation Data Re-integrate Visualisation Data Network Cytoscape R-Script Data Re-integrate Taverna Visualisation FORMAT

Session for Project C: Tools Sage Semantic Ontology (Data Model) Direct Download: just give me the data Search and Browse: web interface Interactive Analysis: extensible workflows Gene Pattern Workflow Taverna Workflow Cytoscape Workflow Related Tools: related communities SCF/SWAN –Tim Clark Bio2RDF – Michel Dumontier

RDF (Semantic) Standard triple: base unit of “meaning”…

Semantic LinkedData

Sage Ontology (OWL)

Tools and Semantics

Tools and LinkedData

Direct Download Go to http://sagebase.org/commons Access standardized datasets and networks contributed to Sage Commons Download networks as: Formatted text files (.tab) Simple interaction files (.sif) Cytoscape session files (.cys) Semantic OWL files (.owl)

Repository of Sage Networks Web App Plug Ins Alitora’s Semantic Repository 14

Repository of Semantic Data Copyright Alitora Systems, Inc. 2009

Semantic Repository Graph Database Designed for network storage & query Scalable to billions of data objects Federated Cloud-deployable Web-scale Indexing 1 billion RDF triples/hour 1000 QPS/CPU: “semantic select” Clustering Algorithms in graph elements Queries can focus on relevant Cluster(s) Typical Query is 1-to-1 to relevant Cluster Worst case query performance is inverted index As per semantic queries, there are no “joins” Full Pathway Queries

Knowledge Relevancy Algorithms help determine which knowledge is important across billions of facts. Sage “KDA” is an example of an algorithm to find important “nodes” in the networks. Relevancy can be based on Graph Topology 17

Collaborative Interface

SageCommons Web Demo

Search and Browse Go to http://saas.alitora.com/sagedemo/ Access web interface to semantic database Anonymous access Login to store and share findings Identify networks for download, visualization and workflows

The first page you see when you go to the Sage Commons Demo website is a list of Sage Commons networks. These networks have been normalized into a semantic data format and integrated into an instance of Alitora’s database system dedicated to storing, querying and accessing Sage Commons networks. From here you can browse information about the networks, including literature references, or download the network as an .rdf file. Click on DETAILS to learn more about the network…

Here you’ll find a dedicated page for the network, providing references, annotations and even network attributes from the inference methods used to generate the network. Click on RELATIONS to view the interactions that make up the network.

Here you’ll find a dedicated page for the network, providing references, annotations and even network attributes from the inference methods used to generate the network. Click on RELATIONS to view the interactions that make up the network.

Here you’ll find a graphic representation of the interaction and its participants. You can click on either participant to learn more about the individual genes. In the next scenario, we will perform a keyword search and work our way from a gene reference to a network.

In the Meme Search field you can search the Commons using keywords for diseases, tissues or gene and protein names. In this scenario, we are going to search by keyword. Type “breast cancer” into the Meme Search tab and click “Search” (or hit return)…we are returned a number of genes and OMIM terms. Each gene result includes links for more details, plus a list of associated Sage Networks and interactions, from which you can drill down into more details. Click on DETAILS to learn more about the gene…

… description, synonyms and xrefs, and other annotations are found on the detailed gene page.

And clicking on details under Associated Networks will take you to the network details page…

…where you’ll find a description, references, a link to all interactions in the network, and a download link for an RDF version of the network.

When you log into your user account, you have the added capability to store a network, interaction or gene into your own memory, where you can follow-up later or choose to share the information with selected colleagues. Furthermore, your Alitora “memory” is persistent in the context of the Cytoscape plugin as well.

Open API Web interface Cytoscape plugin Sage Commons Demo Alex will demo the Altora Cytoscape Plugin later. Note that both the Web Interface and the Cytoscape plugin advantage of the same Alitora semantic database and open API. Cytoscape plugin

Interactive Analysis Extensible workflows direct Sage Commons networks through customizable pipelines for analysis and visualization Access semantic database of networked data Perform Key Driver Analysis (KDA) Write results back to database Visualize network and results in Cytoscape

GenePattern Workflow

An integrative genomics analysis platform with Comprehensive repository of tools Construction of flexible, reproducible analysis workflows Ability to add new tools easily Interface accessible to many levels of user Configurable to available compute resources www.genepattern.org

Client User Interfaces GenePattern: A platform for integrative genomics Module Repository Pipeline Environment Client User Interfaces all_aml_train all_aml_test KNN PCA GISTIC GSEA Preprocess Preprocess SVM NMF SOM Clustering Class Neighbors Weighted Voting Cross-Val Weighted Voting Train/Test Visualizer FLAME CBS SOM Cluster Viewer Marker Selection Viewer Prediction Results Viewer Prediction Results Viewer Module Integrator Golub and Slonim et. al 1999 GenePattern is a platform that consists of several different, interrelated pieces. To see what GenePattern is, let’s see more about these pieces in the context of our integrative genomics example. Replace with community & collaborations Web Programming 34

~ 120 GenePattern Modules Gene expression SNP arrays and aCGH Proteomics Pathway analysis Flow cytometry Statistical methods Data retrieval and formatting Developed both by the Broad and external contributors Number one: it is an analysis tool, with an extensive and ever growing collection of analysis modules What is a module Any piece of code that performs an analysis - Perl script, Java application, call to Web service, Database query, anything that can be expressed in a script or executable that has a command line Some of these are standard analyses like clustering, some our our own novel research, and many are external contributions from other research groups. Because there are so many of them it’s not possible to go into detail about any one of them, but to give you a sampling 35 35

GenePattern is a winner of the 2005 BioIT World Best Practices Award GenePattern Software Release Information Originally released 2004 Current version 3.2.1, released November 2009 Currently 12,000+ users, 500+ organizations, ~90 countries Availability Freely available, runs on Windows, Mac OS, and Linux platforms Resources http://www.genepattern.org User workshops, documentation, email help desk, online user forum Reich et al. (2006) Nature Genetics Collaborations with 2 NIH Biomedical Computing Roadmap Centers and NCI’s cancer biomedical informatics grid (caBIG) And now I want to switch gears and discuss a more recent project that addresses a new, growing need in integrative genomics: GenePattern is a winner of the 2005 BioIT World Best Practices Award 36

Web 2.0 community to share diverse computational tools www.genomespace.org Outreach: new tools 6 Seed Tools Cytoscape Galaxy GenePattern Genomica IGV UCSC Browser 3 Driving Biological Projects Cancer lincRNAs Stem cell circuits 0. Newly started project to build a Web 2.0 community with social networking to share, find, and interoperate diverse computational tools Tools retain their identity and use as stand- alone software and GenomeSpace maintains their native look and feel. Seeded with 6 popular genomics tools representing diverse architectures (cytoscape, galaxy, genepattern, genomica, igv, UCSC browser) Support interoperability through frictionless data transfer with Reproducibility, analytic work flows, comprehensive documentation Development driven by 3 driving biological projects in (cancer, lincRNAs, and stem cell circuits) Live in the Cloud Next phase Engage new tools Engage new biomedical projects Current participating institutions Outreach: new DPBs Partner Institutions 37

GenomeSpace Following on from GenePattern, we have just begun developing GenomeSpace A Web 2.0 community for integrative genomics analysis Share methods Find and work with other’s methods Access support features Bring a dynamic universe of genomics analysis tools to the finger tips of biologists www.genomespace.org

GenomeSpace Provide a core Common Layer for Interoperability Protocols for inter-tool communication Cloud-based project folders and analyses Detailed provenance recording Project 2: Integration of Driving Software Projects GenePattern (Broad) UCSC Genome Browser (UCSC) Genomica (Weizmann Inst.) IGV (Broad) Cytoscape (UCSD) Galaxy (Penn. State) Project 3: Driving Biological Problems Dissect regulatory networks in cancer stem cells Functionally characterize lincRNAs in mammalian genomes Decipher the transcriptional network of hematopoiesis

GenomeSpace GenomeSpace would provide another strong environment for the analysis of Sage data GenomeSpace is actively seeking additional use cases and partners GenePattern and Cytoscape participation in GenomeSpace provide an avenue for Sage integration www.genomespace.org GenomeSpace is now 6 months into a 4 year plan

Performing Key Driver Analysis in GenePattern Sage provided R scripts that perform the KDA analysis These were wrapped as a GenePattern (GP) module GP generated a web user interface and web service for KDA This web service was used to integrate KDA into Taverna A demonstration GenePattern pipeline (workflow) Calculate a differentially expressed genes in a TCGA dataset Perform KDA using a Sage breast cancer network model and the gene list from the differentially expressed genes Reformats the KDA output for Cytoscape Launches Cytoscape to visualize the results

Key Driver Analysis Demo

Taverna Workflow

Taverna A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system Workbench, server, portal Standards-compliant provenance collection Immediate ingest of web services Grid services, Beanshell scripts, R-scripts, BioMOBY services… Web 2.0 social collaboration environments (“E-Labs”) for sharing Methods, workflows Systems biology data, models and SOPS Statistical methods Curated catalogue of Web Services Taverna REST and Cloud coming soon

Taverna Open Suite of Tools Workflow Repository Client User Interfaces Workflow GUI Workbench Third Party Tools Service Catalogue Provenance Store Workflow Server Working on Full OSGi and Web Portal Activity and Service Plug-in Manager Open Provenance Model Programming and APIs Secure Service Access 45

1000s of Services developed by the community Any SOAP based service, REST services soon Grid Services, R Scripts, Beanshell scripts, Java programs, BioMart queries Gene expression SNP arrays and aCGH Proteomics Pathway analysis Systems biology model building Sequence analys Protein structure prediction Gene/protein annotation Microarray data analysis QSAR studies Medical image analysis Epidemiology Model simulation High throughput screening Phenotypical studies Phylogeny Statistical analysis Text mining Data retrieval and formatting QTL studies Number one: it is an analysis tool, with an extensive and ever growing collection of analysis modules What is a module Any piece of code that performs an analysis - Perl script, Java application, call to Web service, Database query, anything that can be expressed in a script or executable that has a command line Some of these are standard analyses like clustering, some our our own novel research, and many are external contributions from other research groups. Because there are so many of them it’s not possible to go into detail about any one of them, but to give you a sampling CDK 46 46

Taverna Software Release Information Taverna first released 2004. Current versions 1.7.2 and Taverna 2.1.2 Currently 1500 + users per month, 350+ organizations, ~40 countries, 80000+ downloads across versions Availability Freely available, open source LGPL On Windows, Mac OS, and Linux platforms Resources http://www.taverna.org.uk, http://www.mygrid.org.uk User and developer workshops, documentation, email help desk Collaborations with numerous groups including NCI’s cancer biomedical informatics grid (caBIG), EMBL-EBI, NCBI, Concept Web Alliance, Bio2RDF And now I want to switch gears and discuss a more recent project that addresses a new, growing need in integrative genomics: 47

myExperiment A Web 2.0 community for sharing, discovering and reusing workflows and other scientific methods. A platform for launching workflows Launched late 2007. Currently: 3272 members, 223 groups, 1024 workflows, 306 files and 97 packs, 56 different countries. 10+ workflow systems: Taverna, Pipeline pilot, BioExtract, Kepler ~ 3000 unique hits per month A scientific gateway Workflow launch REST APIs: Facebook, Google Gadgets, android, PubMed, Twitter Search: OpenSearch API Social: OpenSocial, FOAF, SIOC Identity: Openid, OAuth Open Repository: Dublin Core, OAI-ORE Semantic Web: RDF mirror, SPARQL, Ontology, Linked Open Data REST APIs Linked Open Data Software Open source BSD

Systems Biology and myGrid SysMO-SEEK e-Laboratory for interlinking and sharing data, models, SOPS and workflows for Systems Biology in Europe ISA-TAB & SBML/MIRIAM compliant ONDEX Network based analysis environment for Systems Biology Uses Taverna workflows and text mining http://www.sysmo-db.org/ http://www.ondex.org/

myGrid Taverna provides another strong environment for the analysis of Sage data and linking data with external analytics and data resources, including text mining with publications SysMO-SEEK, myExperiment and BioCatalogue are community collaboration resources for sharing Sage methods, models and data ONDEX is a potentially powerful network analysis tool for Sage Bionetworks GenomeSpace is now 6 months into a 4 year plan

Performing Taverna KDA and Pathways pipeline A demonstration Taverna Pipeline (workflow) Calculate a differentially expressed genes in a TCGA dataset Perform KDA using a Sage breast cancer network model and the gene list from the differentially expressed genes Reformats the KDA output for Cytoscape Launches Cytoscape to visualize the results Extracts gene names from TCGA dataset Finds pathways for these genes in KEGG using workflow deposited in myExperiment.

Taverna pathway pipeline demo Workflow http://www.myexperiment.org/workflows/1191

Taverna Workflow Download files Key Driver Analysis Reformat Cytoscape Sage Extract Entrez ids Map to KEGG gene ids Find KEGG Pathways Re-format results View Results

Cytoscape Workflow

Cytoscape is an open source software platform for integrating, visualizing, and analyzing measurement data in the context of networks Cytoscape is a collaboration between University of California, San Diego Institute for Systems Biology Memorial Sloan-Kettering Cancer Center Institute Pasteur Agilent Technologies University of Toronto Gladstone Institute for Cardiovascular Disease University of California, San Francisco Unilever National Center for Integrative Biomedical Informatics Cytoscape is used for integrating, visualizing and analyzing large datasets in the context of networks Cytoscape development is carried out by a broad, international collaboration (this one of the advantages of open source projects) Another advantage is that it’s free. Free from: http://www.cytoscape.org 60,000+ downloads for 2.x release; 27,000 downloads in the last year; 2,300/month 340+ published articles citing Cytoscape; 135 articles in the last year 50+ registered plugins, developed by leading research groups

Applications of Networks in Disease Identification of disease subnetworks – identification of disease subnetworks that are transcriptionally active in disease Subnetwork-based diagnosis – source of biomarkers for disease classification, identify interconnected genes whose aggregate expression levels are predictive of disease state Network-based gene association – map common pathway mechanisms affected by collection of genotypes (SNP, CNV) Agilent Literature Search A major theme of Sage Commons is the use of networks in the study of disease and Cytoscape is already there, ready for more network content. Here are three different example of disease-related network analyses using Cytoscape: (1) identifying transcriptionally active subnetworks from the literature on Atherosclerosis, (2) using subnetworks to classify tumors and (3) here networks are being used to characterize expression data from Glioblastomas. PinnacleZ, UCSD Mondrian, MSKCC

Core Concepts of Cytoscape Networks – nodes and edges, representing interactions between genes, proteins and metabolites Attributes: data & annotations – experimental data, measurements, and annotations relating to biological entities Visual Mapping – mapping attributes to visual properties of network for data visualization Plugins – extensions to core functionality for custom integration, visualization and analysis Cytoscape works with networks: collections of nodes and edges And with Attributes. Attributes are experimental data or annotations associated with the nodes or edges With these two things on board, you can then use Cytoscape to map the attribute values to the visual properties of the network (like node color) Or you can perform analyses, computing on the attributes mapped onto the network model. There is a large collection of plugins to customize both data visualization and analysis in Cytoscape for all kinds of biological applications…

Cytoscape Workflow Data & Annotations – from experiment to spreadsheet or database, and then into Cytoscape Networks & Pathways – from interaction databases, literature findings, or pathway resources Visualization – mapping attributes to visual properties of network for data visualization Analysis – computing on networked data using extensible system of plugins Publication – prepare and export quality images in a variety of formats, including vector graphics A generic workflow using Cytoscape might start with experimental data in a spreadsheet, for example, which you can import into Cytoscape. Then you would load a network from a resource like Sage Commons and visually map your data onto the networks. You could then use any number of plugin to perform analysis or customize views on the data. Finally, you would export tables and figures ready for publication (though a publication doesn’t happen every time you use Cytoscape). Here, I’m just going to demo Cytoscape plugins that query and import Sage Commons networks and a KDA analysis plugin which can be applied directly within Cytoscape.

Open API Web interface Cytoscape plugin Cytoscape Plugin Cytoscape plugin for network import connects to the same Alitora used by the web interface that Marc demo’d earlier. That allows you to do cool things like this… Cytoscape plugin

Connecting to Your Memory If you are logged in to the Alitora System, you can visualize Sage Commons networks and any other information from your persistent Alitora memory. Click on these “memories” to load them into Cytoscape as networks of nodes, edges and associated attributes.

The plugin gives you access to all Sage Commons networks connected together by a common semantic schema and connected to important public resources such as pubmed, OMIM, Entrez gene and Gene Ontology. The plugin also will import network, node and edge attributes which can then be mapped to visual properties in Cytoscape. Once in Cytoscape, Sage networks and associated attributes can be used by other Cytoscape plugins for visualization and analysis. For example, the Key Driver Analysis plugin developed by Sage…

KDA Plugin

Tools Afternoon Session Review of developments to date Creating Semantic Model for Sage Networks Storing Sage Networks with Alitora for Search & Visualization Performing Key Driver Analysis with GenePattern Taverna workflow for annotating and analyzing the network model Working with Sage Networks in Cytoscape Other network model tools Additional tool providers discuss integrating with Sage Looking forward open questions and gaps breakout sessions

SCF/SWAN Tim Clark Instructor in Neurology, Harvard Medical School Director of Informatics, MassGeneral Institute for Neurodegenerative Disease Core Member, Harvard Initiative in Innovative Computing

Bio2RDF Michel Dumontier Associate Professor Department of Biology School of Computer Science Institute of Biochemistry University of Carleston, Canada

Tools Afternoon Session Review of developments to date Creating Semantic Model for Sage Networks Storing Sage Networks with Alitora for Search & Visualization Performing Key Driver Analysis with GenePattern Taverna workflow for annotating and analyzing the network model Working with Sage Networks in Cytoscape Other network model tools Additional tool providers discuss integrating with Sage Looking forward open questions and gaps breakout sessions

Implications for Sage infrastructure Lessons Learned: 1. Standard network & gene list file formats are critical to the success of infrastructure tools. 2. Current dataset and network repositories fall short of providing a community resource with adequate standards and extensible tools. Challenges Ahead: 1. Preparing for increasing scale and scope of data 2. Preparing for future data types and analyses Formats Services Identifiers Lessons Learned: 1. Standard network & gene list file formats are critical to the success of infrastructure tools. 2. Current dataset and network repositories fall short of providing a community resource with adequate standards and extensible tools. Challenges Ahead: 1. Preparing for increasing scale and scope of data 2. Preparing for future data types and analyses Map to standards Appropriate interfaces

Domain Semantics Domain Semantics Information models Ontologies Ontologies Semantics Custom Data Objects Custom Data Objects Information models Information models Syntax Syntax Configuration Configuration Invocation model Invocation model Need they be just data objects? Syntax Interface Interface Data format Data format Data identity Data Identity

Keep It Simple. Open Source.

Web 2.0 Development Patterns The Long Tail Leverage scientist-self service to reach out to the long tail Users Add Value Involve colleagues and other scientists, both implicitly and explicitly, in adding value to your application. Network Effects by Default Set inclusive defaults for aggregating user data as a side-effect of their use of the application. Perpetual Beta Don't package up new features into monolithic releases. Add them on a regular basis as part of the normal user experience. Cooperate, Don't Control Design for mash ups. Offer web services interfaces and content syndication, and re-use the services of others. Some Rights Reserved. Benefits come from collective adoption. Make sure that barriers to adoption are low. Follow existing standards.Use licenses with as few restrictions as possible. Design for "hackability" and "remixability." Data is the Next Intel Inside Applications are increasingly data-driven. For competitive advantage, seek to own a unique, hard-to-recreate source of data – workflows are data and data sources. Software Above the Level of a Single Device Design your application from the get-go to integrate and launch services across any interface. In his book, A Pattern Language, Christopher Alexander prescribes a format for the concise description of the solution to architectural problems. He writes: "Each pattern describes a problem that occurs over and over again in our environment, and then describes the core of the solution to that problem, in such a way that you can use this solution a million times over, without ever doing it the same way twice." The Long Tail Small sites make up the bulk of the internet's content; narrow niches make up the bulk of internet's the possible applications. Therefore: Leverage customer-self service and algorithmic data management to reach out to the entire web, to the edges and not just the center, to the long tail and not just the head. Data is the Next Intel Inside Applications are increasingly data-driven. Therefore: For competitive advantage, seek to own a unique, hard-to-recreate source of data. Users Add Value The key to competitive advantage in internet applications is the extent to which users add their own data to that which you provide. Therefore: Don't restrict your "architecture of participation" to software development. Involve your users both implicitly and explicitly in adding value to your application. Network Effects by Default Only a small percentage of users will go to the trouble of adding value to your application. Therefore: Set inclusive defaults for aggregating user data as a side-effect of their use of the application. Some Rights Reserved. Intellectual property protection limits re-use and prevents experimentation. Therefore: When benefits come from collective adoption, not private restriction, make sure that barriers to adoption are low. Follow existing standards, and use licenses with as few restrictions as possible. Design for "hackability" and "remixability." The Perpetual Beta When devices and programs are connected to the internet, applications are no longer software artifacts, they are ongoing services. Therefore: Don't package up new features into monolithic releases, but instead add them on a regular basis as part of the normal user experience. Engage your users as real-time testers, and instrument the service so that you know how people use the new features. Cooperate, Don't Control Web 2.0 applications are built of a network of cooperating data services. Therefore: Offer web services interfaces and content syndication, and re-use the data services of others. Support lightweight programming models that allow for loosely-coupled systems. Software Above the Level of a Single Device The PC is no longer the only access device for internet applications, and applications that are limited to a single device are less valuable than those that are connected. Therefore: Design your application from the get-go to integrate services across handheld devices, PCs, and internet servers. Adapted from Tim O’Reilly’s Web 2.0 2005

This afternoon Drill down into demos and experiences Guests Tim Clark – SWAN, Web 3.0, neurodegeneration Michel Dumontier – Bio2RDF Audience participation! Opportunities, Barriers and Incentives Platforms, datasets, services and tools Technologies and Standards Directions for Sage Bionetworks

Questions for Afternoon Are there specific gene list and network model databases, tools and platforms that we want to integrate with the Sage Data? e.g. MSigDB gene lists What form of integrated analysis would be most useful for finding new biological insights using the Sage models and KDA? e.g. Would we like to be able to create lists of mutations from TCGA to use as inputs to KDA and the Sage models? What model annotations are necessary to make this useful – context?

Questions for Afternoon Provenance - what is needed at Sage to ensure provenance of network models is preserved for future reference? E.g. do models need unique, persistent, referencable identifiers? Will they be versioned? If models change due to new data, or updated algorithms, how can we easily rerun analyses? What privacy software do we need and could leverage? Will SageCommons need to be ‘replicable’ at other sites to support privacy - e.g. Pharma and Biotech who do not want their use of the models to be potentially snooped on the ‘net?

Audit of Tools

Systems Biology and myGrid SysMO-SEEK e-Laboratory for interlinking and sharing data, models, SOPS and workflows for Systems Biology in Europe ISA-TAB & SBML/MIRIAM compliant ONDEX Network based analysis environment for Systems Biology Uses Taverna workflows and text mining http://www.sysmo-db.org/ http://www.ondex.org/