Wrapping third- party analytical services for caBIG Taverna-caBIG project Stian Soiland-Reyes Alexandra Nenadic University of Manchester, UK

Slides:



Advertisements
Similar presentations
Introduction The cancerGrid metadata registry (cgMDR) has proved effective as a lightweight, desktop solution, interoperable with caDSR, targeted at the.
Advertisements

CACORE TOOLS FEATURES. caCORE SDK Features caCORE Workbench Plugin EA/ArgoUML Plug-in development Integrated support of semantic integration in the plugin.
EBI is an Outstation of the European Molecular Biology Laboratory. Web Services Course CBS, DK. EBI Web Services Teresa Miyar EMBL-EBI External Services.
SDM center All-hands breakout session notes March 2002 Gatlinburg TN.
European Bioinformatic Institute.
European Life Sciences Infrastructure for Biological Information Rafael C Jimenez ELIXIR CTO EMBL-EBI workshop networks and pathways.
Design of Web-based Systems IS Development: lecture 10.
CaGrid Service Metadata Scott Oster - Ohio State
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Asynchronous Web Services Approach Enrique de Andrés Saiz.
Best Practices for Including Enumerated Value Domains in UML Models What are the mechanics of creating CDEs associated with enumerated value domains in.
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
OpenMDR: Generating Semantically Annotated Grid Services Rakesh Dhaval Shannon Hastings.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Department of Biomedical Informatics Development of Ontology-anchored Grid-based Data Services to Facilitate Integrative Clinical and Translational Science.
OpenMDR: Alternative Methods for Generating Semantically Annotated Grid Services Rakesh Dhaval Shannon Hastings.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Cancer Bioinformatics Grid (caBIG) CANS 2006 Chicago, Illinois Shannon Hastings Department of Biomedical Informatics Ohio State University.
Taverna and my Grid Basic overview and Introduction Tom Oinn
14/11/11 Taverna Roadmap Shoaib Sufi myGrid Project Manager.
SEMESTER PROJECT PRESENTATION CS 6030 – Bioinformatics Instructor Dr.Elise de Doncker Chandana Guduru Jason Eric Johnson.
LexEVS Overview Mayo Clinic Rochester, Minnesota June 2009.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Taverna Workflow. A suite of tools for bioinformatics Fully featured, extensible and scalable scientific workflow management system – Workbench, server,
Building and Running caGrid Workflows in Taverna 1 Computation Institute, University of Chicago and Argonne National Laboratory, Chicago, IL, USA 2 Mathematics.
H Using the Open Metadata Registry (OpenMDR) to generate semantically annotated grid services Rakesh Dhaval, MS, Calixto Melean,
CaBIG Workflow University of Chicago, USA University of Manchester, UK.
Interoperability between Scientific Workflows Ahmed Alqaoud, Ian Taylor, and Andrew Jones Cardiff University 10/09/2008.
Copyright OpenHelix. No use or reproduction without express written consent1.
Phase II Additions to LSG Search capability to Gene Browser –Though GUI in Gene Browser BLAST plugin that invokes remote EBI BLAST service Working set.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
CaDSR Software Users Meeting 3.1 Requirements Review 9/19/2005 caDSR Software Team Host: Denise Warzel NCICB, Assistant Director, caDSR.
1 ECCF Training 2.0 Implemental Perspective (IP) ECCF Training Working Group January 2011.
Technology behind using Taverna in caGrid caGrid user meeting Stian Soiland-Reyes, myGrid University of Manchester, UK
ModelPedia Model Driven Engineering Graphical User Interfaces for Web 2.0 Sites Centro de Informática – CIn/UFPe ORCAS Group Eclipse GMF Fábio M. Pereira.
Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,
EMBOSS over a Grid 1. 1st EELA Grid School December 4th of 2006 Eduardo MURRIETA LEON Romualdo ZAYAS-LAGUNAS Pierre-Alain BRANGER Jérôme VERLEYEN Roberto.
Basic Local Alignment Search Tool BLAST Why Use BLAST?
Database search. Overview : 1. FastA : is suitable for protein sequence searching 2. BLAST : is suitable for DNA, RNA, protein sequence searching.
1 Service Creation, Advertisement and Discovery Including caCORE SDK and ISO21090 William Stephens Operations Manager caGrid Knowledge Center February.
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
EMBL-EBI Structural Proteomics Automatic Target Selection Gordon Whamond.
The Protein Identifier Cross-Reference (PICR) service.
Wrapping analytical services for caBIG Taverna-caGrid technical review meeting Stian Soiland-Reyes, myGrid University of Manchester, UK
Copyright OpenHelix. No use or reproduction without express written consent1.
EnVisioning Data Integration SME forum 2009, Vienna Henning Hermjakob Henning Hermjakob
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
What is BLAST? Basic BLAST search What is BLAST?
Compatibility Review System 3.0 Robert Freimuth October 28, 2008 Overview.
Designing, Executing and Sharing Workflows with Taverna 2.2 Katy Wolstencroft myGrid University of Manchester.
PROTEIN IDENTIFIER IAN ROBERTS JOSEPH INFANTI NICOLE FERRARO.
Exploring Taverna engine Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester.
National Cancer Institute caDSR Briefing for Small Scale Harmonication Project Denise Warzel Associate Director, Core Infrastructure caCORE Product Line.
Advanced Taverna Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft, Aleksandra Pawlik, Alan Williams
An Introduction to Running, Reusing and Sharing Workflows with Taverna – part 2 Aleksandra Pawlik materials by Katy Wolstencroft University of Manchester.
Taverna allows you to automatically iterate through large data sets. This section introduces you to some of the more advanced configuration options for.
Semantic Interoperability: caCORE and the Cancer Data Standards Repository (caDSR)  Jennifer Brush.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Cancer Bioinformatics Grid (caBIG) CANS 2006 Chicago, Illinois
Take a REST from manual searching
What is BLAST? Basic BLAST search What is BLAST?
EMBL-EBI, programmatically - take a REST from manual searching: Sequence analysis tools Web Production Team Anna Foix Joon Lee.
Programmatic access to EMBL-EBI resources
Taverna workflow management system
Shim (Helper) Services and Beanshell Services
Presentation transcript:

Wrapping third- party analytical services for caBIG Taverna-caBIG project Stian Soiland-Reyes Alexandra Nenadic University of Manchester, UK September 2009

Agenda Project overview Primary goals Service selection Why these services? Why wrapping? Wrapping benefits? How we did it How does it work Architecture UML models Example client and outputs Project info

Project overview Taverna-caBIG cooperation on several levels: 1.caGrid-enabling third party analytical services 2.Taverna Workbench enhancements for: Semantic search of caBIG services Invocation of caBIG services from Taverna workflows Support for secure caBIG services (interacting with GAARDS infrastructure prior to service invocation) This presentation addresses caGrid-enablement of third party analytical services (wrapping + achieving silver level of compatibility)

Primary goals Identify two publicly available analytical services currently accessible through Taverna Wrap, i.e. caGrid-enable, the services: Design the wrapper services in UML and semantically describe/annotate them using caBIG’s tooling (EA + SIW) Wrap/implement and deploy them as standard caBIG services on caGrid (Introduce)

Analytical service selection Services have been selected in collaboration with caBIG Workflow Working Group, lead by Juli Klemm Winners: NCBI BLAST service hosted by EBI (European Bioinformatics Institute) Protein and nucleotide sequence similarity search service InterProScan service hosted by EBI Scans a range of protein signatures in InterPro warehouse against a protein sequence

Why these services? Freely available Highly reliable, hosted by EBI Widely used by the scientific community Can be combined with existing caBIG tools in biologically meaningful workflows caBIO, GridPIR, etc.

NCBI BLAST service A popular sequence similarity search tool using local sequence alignment Supports sequences of proteins, DNA, RNA Searches sequences in a whole range of databases: UNIPROT, NCBI, EMBL, etc. SOAP web service hosted by EMBL-EBI

InterProScan service InterPro warehouse integrates various databases of protein domains and functional sites Searches the InterPro warehouse using protein signature recognition methods, e.g. blastprodom, gene3d, hmmpfam, hmmsmart, scanregexp, profilescan.. SOAP web service hosted by EMBL-EBI

Why wrapping the services? Original services use various data formats for inputs/outputs (although xml) Does not conform to the caBIG compatibility rules Output format was not even compatible with input format The requirement for the wrapped service: Translate the input data from caBIG-compatible xml to xml format understood by analytical services Convert the received results back to a format understood by caBIG clients

NCBI BLAST Output (Untranslated) <EBIApplicationResult xmlns=" xmlns:xsi=" xsi:noNamespaceSchemaLocation=" <hit number="1" database="uniprot" id="WAP_RAT" ac="P01174" length="137" description="Whey acidic protein OS=Rattus norvegicus GN=Wap PE=1 SV=2"> e MRCSISLVLGLLALEVALARNLQ EHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRS CKTPVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDS FSEDTECINCQTNEECAQNDMCCPSSC GRSCKTPVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ MRCSISLVLGLLALEVALARNLQ EHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKT PVNIEVQKAGRCPWNPIQMIAAGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ

InterProScan Output (Untranslated) instancehttp:// <interpro id="IPR008197" name="Whey acidic protein, 4-disulphide core" type="Domain" parent_id="IPR015874"> Molecular Function protease inhibitor activity <match id="G3DSA: " name="Whey_acidic_protein_4-diS_core" dbname="GENE3D"> <location start="77" end="128" score=" E-5" status="T" evidence="Gene3D" /> <location start="30" end="72" score=" E-5" status="T" evidence="HMMPfam" /> <location start="79" end="126" score=" E-14" status="T" evidence="HMMPfam" /> <interpro id="IPR008198" name="Proteinase inhibitor I17" type="Domain" parent_id="IPR008197">...

Motivational workflow This Taverna workflow uses both Blast and InterProScan which can be replaced with wrapped versions of the services Nested workflow that internally invokes InterProScan and checks job status before fetching results Nested workflow that internally invokes NCBI BLAST and checks job status before fetching results Web Service that looks up protein sequences in a database. Will be replaced with the caBIG service caBIO. Shim that splits a string into a list of Fasta strings

Benefits of wrapped services Making analytical services from other service providers available to caBIG users Wrapped services are caBIG Silver Level compatible: Ensures shared meaning and interoperability between these and other caBIG services Data can be exchanged and understood between services

How we wrapped the services (1) Making the services ‘silver’ encompassed: 1.Modelled data in UML using Enterprise Architect (EA) 2.Exported model to XMI from EA 3.Using the SIW tool, the XMI file has been semantically annotated using caBIG’s vocabularies/ontologies 4.Common Data Elements (CDEs) have been generated for services inputs/outputs, reviewed by the curation team and loaded into caDSR production database 5.Annotated XMI loaded back to the EA to update UML

How we wrapped the services (2) 6.From the EA, the UML model was exported to a set of xsd files 7.The xsd files have been imported into the Introduce tool, which was used to generate the skeleton APIs of the wrapped services 8.Axis 2 was used to invoke the original InterPro and NCBI BLAST services from the wrapper services 9.The wrapped services are asynchronous; job status and results are available as WSRF resource properties and can be subscribed to using WS-Notifications. There is also a synchronous version where polling is done from the client side.

How it works Client: using client library, calls wrapped WSRF web service Service: convert input to original format, submit converted input to original service, return a Job Resource that references the jobID Client: Subscribe to notifications from job resource Job Monitor (server): For all jobs, check status using jobID, notify client on completion Client library: Request output data Job Resource: Convert data from original format, Return converted data to client

Architecture of wrapped services

UML model of wrapped NCBI BLAST

UML model of wrapped InterProScan

Reused several data elements Green classes in diagram reused from IRWG Sequence, NucleicAcidSequence DatabaseCrossReference GeneGenomicIdentifier et al. Red UML classes in diagram reused from PIR ProteinSequence Partial reuse of attributes in ProteinDomainLocation

Example client NCBI Blast NCBIBlastClient client = new NCBIBlastClient(url); NCBIBlastInput input = new NCBIBlastInput(); ProteinSequenceRepresentation sequenceRepresentation = new ProteinSequenceRepresentation(); ProteinGenomicIdentifier proteinId = new ProteinGenomicIdentifier(); proteinId.setDataSourceName("uniprot"); proteinId.setCrossReferenceId("wap_rat"); sequenceRepresentation.setProteinId(proteinId); input.setSequenceRepresentation(sequenceRepresentation); NCBIBlastInputParameters params = new NCBIBlastInputParameters(); params.setQueryDatabase(new MolecularSequenceDatabase("", "uniprot")); params.setBlastProgram(BLASTProgram.BLASTP); input.setNcbiBLASTInputParameters(params); NCBIBlastClientUtils clientUtils = new NCBIBlastClientUtils(client); NCBIBlastOutput ncbiBlastOut = clientUtils.ncbiBlastSync(input, TIMEOUT_SECONDS * 1000); SequenceSimilarity[] similarities = ncbiBlastOut.getSequenceSimilarities(); for (SequenceSimilarity similarity : similarities) { for (Alignment align : similarity.getAlignments()) { SequenceFragment querySequenceFragment = align.getQuerySequenceFragment(); System.out.print("Q: " + querySequenceFragment.getSequence().getValue()); (..) data id

Example SOAP input NCBI Blast BLASTP uniprot wap_rat uniprot data reused id

Example client output NCBI Blast Running NCBI Blast client uk.org.mygrid.cagrid.servicewrapper.service.ncbiblast.example. ExampleNCBIBlastClient -url -- Using default service at Calling NCBI Blast synchronously (Set -DGLOBUS_LOCATION=/Users/bob/cagrid/ws-core to do asynchronous client calls) Found 50 similarities Similarity in uniprot:WAP_RAT (sequence length:137) 1 alignments Alignment score=763.0 bits=298.0 eValue=1.0E-79 Q: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ P: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ M: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA AGPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVISFQ Similarity in uniprot:Q3UQ94_MOUSE (sequence length:140) 1 alignments Alignment score=465.0 bits=183.0 eValue=4.0E-45 Q: MRCSISLVLGLLALEVALARNLQEHVFNSVQSMCSDDSFSEDTECINCQTNEECAQNDMCCPSSCGRSCKTPVNIEVQKAGRCPWNPIQMIA A-GPCPKDNPCSIDSDCSGTMKCCKNGCIMSCMDPEPKSPTVI P: MRC ISLVLGLLALEVALA+NL+E VFNSVQSM S E TECI CQTNEECAQN MCCP SCGR+ KTPVNI V KAG CPWN +QMI+ + GPCP CS D +CSG MKCC C+M+C P P+ ++I M: MRCLISLVLGLLALEVALAQNLEEQVFNSVQSMFPKASPIEGTECIICQTNEECAQNAMCCPGSCGRTRKTPVNIGVPKAGFCPWNLLQMIS STGPCPMKIECSSDRECSGNMKCCNVDCVMTCTPPVPEVWSII dataid

Example SOAP output NCBI Blast P01174 Whey acidic protein OS=Rattus norvegicus GN=Wap PE=1 SV=2 137 WAP_RAT uniprot E MRCSISLVLGLLALEVAL..ISFQ MRCSISLVLGLLALEVAL..ISFQ MRCSISLVLGLLALEVAL..ISFQ data reusedid

Project info On gForge: On myGrid wiki: Source and documentation available via Subversion: