GGF Summer School 24 th July 2004, Italy Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Web Service Ahmed Gamal Ahmed Nile University Bioinformatics Group
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
1 Introduction to XML. XML eXtensible implies that users define tag content Markup implies it is a coded document Language implies it is a metalanguage.
Distributed components
Network Management Overview IACT 918 July 2004 Gene Awyzio SITACS University of Wollongong.
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
1 Workflow Description for Open Hypermedia Systems Sanjay Vivek, David C. De Roure Department of Electronics and Computer Science.
UvA, Amsterdam June 2007WS-VLAM Introduction presentation WS-VLAM Requirements list known as the WS-VLAM wishlist System and Network Engineering group.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
14-18 March 2004 EDBT'04 : Service-Based Distributed Query Processing for the Grid (M N Alpdemir) 1 Title, places, people, funding, projects Manchester.
Standards for Technology in Automotive Retail STAR Workbench 1.0 Michelle Vidanes & Dave Carver STAR XML Data Architects, Certified Scrum Masters.
UNIT-V The MVC architecture and Struts Framework.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
January, 23, 2006 Ilkay Altintas
The Design Discipline.
T Network Application Frameworks and XML Web Services and WSDL Sasu Tarkoma Based on slides by Pekka Nikander.
JavaScript, Fourth Edition Chapter 12 Updating Web Pages with AJAX.
Taverna and my Grid Basic overview and Introduction Tom Oinn
14/11/11 Taverna Roadmap Shoaib Sufi myGrid Project Manager.
An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.
11 Web Services. 22 Objectives You will be able to Say what a web service is. Write and deploy a simple web service. Test a simple web service. Write.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
XML Web Services Architecture Siddharth Ruchandani CS 6362 – SW Architecture & Design Summer /11/05.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
Systems Analysis and Design in a Changing World, 3rd Edition
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
Tom Oinn, In general a grid system is, or should be : “A collection of a resources able to act collaboratively in pursuit of an overall.
Software Engineering Prof. Ing. Ivo Vondrak, CSc. Dept. of Computer Science Technical University of Ostrava
Implementing computational analysis through Web services Arnaud Kerhornou CRG/INB Barcelona - BioMed Workshop IRB November 2007.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
LSIDs in a Nutshell Jun Zhao University of Manchester 1 st December, 2005.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Association of variations in I kappa B-epsilon with Graves' disease using classical and my Grid methodologies Peter Li School of Computing Science University.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
Bioinformatics Workflows Chris Wroe (based on material from the myGrid team & May Tassabehji / Hannah Tipney Medical Genetics, St Marys)
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
1 G52IWS: Web Services Chris Greenhalgh. 2 Contents The World Wide Web Web Services example scenario Motivations Basic Operational Model Supporting standards.
Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff.
Using DAML+OIL Ontologies for Service Discovery in myGrid Chris Wroe, Robert Stevens, Carole Goble, Angus Roberts, Mark Greenwood
Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.
Overview of Grid Webservices in Distributed Scientific Applications Dennis Gannon Aleksander Slominski Indiana University Extreme! Lab.
The my Grid Information Model Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe AHM2004, 1 September
Portals and my Grid Stefan Rennick Egglestone Mixed Reality Laboratory University of Nottingham.
1 G52IWS: Web Services Description Language (WSDL) Chris Greenhalgh
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Life Science Identifiers Chris Wroe (based on material from myGrid team and IBM Life Sciences)
OGSA-DQP Steven Lynden University of Manchester. Data access & integration with OGSA-DAI: GGF 17 2 Introduction OGSA-DQP is a service based distributed.
MyGrid: Personalised Bioinformatics on the Information Grid Robert Stevens, Alan Robinson & Carole Goble University of Manchester & EBI, UK myGrid project.
Designing, Executing and Sharing Workflows with Taverna 2.2 Katy Wolstencroft myGrid University of Manchester.
Taverna, myExperiment and HELIO services Anja Le Blanc Stian Soiland-Reyes Alan Willams University of Manchester.
Workflow and myGrid Justin Ferris IT Innovation Centre 7 October 2003 Life Sciences Grid GGF9.
Exploring Taverna 2 Katy Wolstencroft myGrid University of Manchester.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
Business rules.
Soaplab - overview SOAP-based Analysis Web Services
T Network Application Frameworks and XML Web Services and WSDL Sasu Tarkoma Based on slides by Pekka Nikander.
Service-centric Software Engineering
Chapter 7 –Implementation Issues
Shim (Helper) Services and Beanshell Services
An Introduction to Designing and Executing Workflows with Taverna
Presentation transcript:

GGF Summer School 24 th July 2004, Italy Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the MIR. Domain Services – Native, Soaplab and Gowlab. Taverna/Freefluo Workbench and Workflow Enactor. Professor Carole Goble University of Manchester, UK

GGF Summer School 24 th July 2004, Italy Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the MIR. Domain Services – Native, Soaplab and Gowlab. Taverna/Freefluo Workbench and Workflow Enactor.

GGF Summer School 24 th July 2004, Italy 20,000 feet Freefluo Workflow Engine LSID Authority UDDI mIR metadata Store Service Provenance and Data browser Haystack or Portal Web services, local tools User interaction etc. Taverna Workbench Portal View Service Semantic Discovery & Registration Event Notification Service mIR data

GGF Summer School 24 th July 2004, Italy Information Model v2 my Grid components form a loosely coupled system An Information Model for e-Science experiments Based on CCLRC scientific metadata model XML messages between services conform to the IMv2 Domain specific Domain neutral Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe, The myGrid Information Model, Proc UK e-Science 2 nd All Hands Meeting, Nottingham, UK 1-3 Sept 2004.

GGF Summer School 24 th July 2004, Italy Information Model v2 my Grid components form a loosely coupled system An Information Model for e-Science experiments Based on CCLRC scientific metadata model XML messages between services conform to the IMv2 Domain specific Domain neutral Resources and Ids People, teams and organizations Scientific data and the Life Science Identifier Types, Identifier Types, Values and Documents Provenance information Annotation and Argumentation Molecular Biology Bioinformatics e-Science process, experimental methods

GGF Summer School 24 th July 2004, Italy Layered Semantics Domain Semantics layered on top of domain neutral but scientific data model Reducing the activation energy, lowering barriers of entry. Workflow metadata Provenance metadata Service Metadata Data Metadata Syntax Domain Semantics Ontologies Workflow OGSA-DQP Format XSD types MIME types Experiment Semantics IMv2

GGF Summer School 24 th July 2004, Italy Experimental entities

GGF Summer School 24 th July 2004, Italy View over the MIR

GGF Summer School 24 th July 2004, Italy Life Science IDs Each database on the web has: –Different policies for assigning and maintaining identifiers, dealing with versioning etc. –Different mechanism for retrieving an item given an ID. Life Science IDs designed to harmonise the retrieval of data. Emerging standard for bioinformatics –I3C, OMG Life Sciences Group, W3C Defines: –URN for life science resources –SOAP (and other) interfaces for LSID assignment, LSID resolution & resolution discovery services T. Clark, S. Martin & T. Liefeld: Globally distributed object identification for biological knowledge bases, Briefings in Bioinformatics Vol 5 No 1 pp 59-70, March 2004

GGF Summer School 24 th July 2004, Italy What is an LSID? urn:lsid:AuthorityID:NamespaceID:ObjectID:[RevisionID] urn:lsid:ncbi.nlm.nig.gov:GenBank:T48601:2 urn:lsid:ebi.ac.uk:SWISS-PROT.accession:P34355:3 urn:lsid:rcsb.org:PDB:1D4X:22 LSID Designator: A mandatory preface that notes that the item being identified is a life science-specific resource Authority Identifier: An Internet domain owned by the organization that assigns an LSID to a resource Namespace Identifier: The name of the resource (e.g., a database) chosen by the assigning organization Object Identifier: The unique name of an item (e.g., a gene name or a publication tracking number) as defined within the context of a given database Revision Identifier: An optional parameter to keep track of different versions of the same item

GGF Summer School 24 th July 2004, Italy LSID Properties Unique authority for each identifier Multiple resolution services, supporting: –Data retrieval – data immutable: data returned for a given LSID must always be the same caches –Metadata retrieval – mutable and resolver-specific annotation services. More on this in Part 4 Resolution discovery service –Implemented over DNS/DDNS (Optional) Authority commitment: must always maintain an authority at e.g. pdb.org that can point to data and metadata resolvers.

GGF Summer School 24 th July 2004, Italy How is data retrieved? Application LSID client 1. Get me info for: urn:lsid:pdb.org:1AFT PDB pdb.org PDB Data resolver PDB Metadata resolver PDB database 2. Where can I get data and metadata for urn:lsid:pdb.org.1AFT 2. Get me the data and metadata for: urn:lsid:pdb.org:1AFT

GGF Summer School 24 th July 2004, Italy LSID Components IBM built client and server implementations in Perl, Java, C++ Straightforward to wrap an existing database as a source of data or metadata Client simple to use LSID Launchpad adds LSID resolution to Internet Explorer LSID aware client applications, e.g. Haystack (see Part 4).

GGF Summer School 24 th July 2004, Italy Use within my Grid Needed an identifier for our own experimental resources –workflows, experiments, new data results etc All and everything identified with LSIDs LSID saves us having to invent our own conventions and code. Can pass references to data around and be reassured the other party will know how to resolve that reference Resolution services: –Data: my Grid Information Repository (MIR) –Metadata: my Grid Metadata Store (RDF-based) As a client: –Uniform access to my Grid and external resources Retrieval Annotation (see Part 4)

GGF Summer School 24 th July 2004, Italy Information Access Metadata Store Data store LSID interface LSID aware client Query RDF aware client Taverna/ Freefluo MIR Metadata Store RDF MIR Data store XML Publish interface data metadata XML aware client Query

GGF Summer School 24 th July 2004, Italy Enactor Services LSID Assigning Service Store plug-in Metadata plug-in Metadata Store mIR Workflow design User context LSID Metadata Resolver LSID Data Resolver LSID Authority Client application Metadata Data LSIDs Requests 1. Data sent/ received from services 2. New LSIDs assigned to data 3. Data / Metadata stored 4. Data and metadata retrieved LSID Assignment

GGF Summer School 24 th July 2004, Italy Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the MIR. Domain Services – Native, Soaplab and Gowlab. Taverna/Freefluo Workbench and Workflow Enactor.

GGF Summer School 24 th July 2004, Italy Every entry has Dublin Core provenance attributes Every entry can have (multiple) ontology expressions Multiple mIRs The (MIR) metadata store –RDF using Jena 2.0 Information Storage The MIR data store Stores experimental components –Workflow specs as XML Scufl docs –Data, XML notes –Types: XML docs, Relational

GGF Summer School 24 th July 2004, Italy Metamodel for Types Necessary to identify the type and format of each datum of interest so that it can (only) be input to type- compatible viewers, services and workflows. Can’t fix this – working in an open world. There are many established, de facto and locally preferred types & formats. Define common bio-types a fool’s errand.

GGF Summer School 24 th July 2004, Italy Intermediate Results

GGF Summer School 24 th July 2004, Italy Taverna/Freefluo WfEE agnostic about the data flowing through it. As objects progress through tagged with terms from ontologies, free text descriptions and MIME types, and which may contain arbitrary collection structures. Using the metadata hints we can locate and launch pluggable view components. One WBS workflow can produce ~130 files. (intermediate) results management and presentation a major headache. Results Management

GGF Summer School 24 th July 2004, Italy

Results Amplification One input Many outputs Automated annotation workflows produce lots of heterogeneous data The workflows changed how scientist works. Before: analyse results as go along After: all results, all the analysis, in one go Intermediate results management and associated provenance management essential Domain specific visualisation

GGF Summer School 24 th July 2004, Italy

Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the MIR. Domain Services – Native, Soaplab and Gowlab. Taverna/Freefluo Workbench and Workflow Enactor.

GGF Summer School 24 th July 2004, Italy Domain Services Native WSDL Web services –DDBJ, NCBI BLAST, PathPort BioMOBY Web services –Single function stereotype Wrapped legacy services –Stateful interaction stereotype –One button wrapping –SoapLab for command-line tools –GowLab for screen scraped web pages – –Leveraged the EMBOSS Suite and others –Circa 300 services For each application CreateJob Run WaitFor GetResults Destroy

GGF Summer School 24 th July 2004, Italy Domain Services Lots of them ~ 300 Open world: we don’t own them Many produce text not numbers Many are unique, single site Need lots of genuine redundant replica services Unreliable and unstable –Research level software –Reliant on other peoples servers Services in the wild rare -significant time to wrap applications as web services (licensing, installation, maintenance) WSDL in the wild is poor Firewalls Licensing –Can’t be used outside of licensing body –No license = access third-party webservices Domain Services in WBS Repeatmasker NCBI_BLAST Modified BLAST GenScan PSORTII iPSORT TargetP Various EMBOSS services InterProScan BLAST2 NIX TESS TWINSCAN Alibaba2 SignalScan Promotorscan SumoPlot SignalP Copyright

GGF Summer School 24 th July 2004, Italy Can you guess what it is yet?

GGF Summer School 24 th July 2004, Italy SHIM Services Main Bioinformatics Applications Main Bioinformatics Services Main Bioinformatics Application SHIM Services Explicitly capturing the process Unrecorded ‘steps’ which aren’t realised until attempting to build something Services that enable domain services to fit together “experimentally neutral” Libraries of SHIMs Possible candidates for automatic selection, composition and substitution Reusable

GGF Summer School 24 th July 2004, Italy Part 3: Integrating Services Life Science Identifiers & Information model. Data and Metadata management – the MIR. Domain Services – Native, Soaplab and Gowlab. Taverna/Freefluo Workbench and Workflow Enactor.

GGF Summer School 24 th July 2004, Italy Freefluo workflow enactment engine –Processor & event observer plugin support Taverna development and execution environment –Workbench, workflow editor, tool plug-in support Simple conceptual unified flow language (XScufl) wraps up units of activity –More user friendly, more abstract, more directly in user terms “tethered” programme: own open source development community Workflow development and enactment

GGF Summer School 24 th July 2004, Italy tree structure explorer graphical diagram Results in enactor invocation window service palette shows a range of operations which can be used in the composition of a workflow

GGF Summer School 24 th July 2004, Italy Workflow environment Taverna API acts as an intermediate layer between user level applications and workflow enactors such as FreeFluo. Includes object models using a standard MVC design for both workflow definitions and data objects within a workflow Implicit iteration and data flow Data sets and nested flows Configurable failure handling Life Science ID resolution Plug-in framework Event notification Provenance and status reporting Permissive type management Graphical display Data entry wizard

GGF Summer School 24 th July 2004, Italy Scufl-Taverna-FreeFluo SCUFL - Simple Conceptual Unified Flow Language Started with WSFL  … SCUFL provides a much higher level view on workflows, and therefore simpler and more user-focused. Simple – relies upon an inherently connected environment to reduce the quantity of information explicitly stated in the workflow definition. –No port definitions in XScufl –Processor metadata intelligently gathered from underlying sources i.e. WSDL, Soaplab –Allows optional typing information, can specify as little or as much as is available

GGF Summer School 24 th July 2004, Italy Scufl Conceptual – one Processor in a SCUFL workflow maps as far as is possible to one conceptual operation as viewed by a non expert user –Wrap up stateful service interactions into custom Processor implementations –Lowers the barrier preventing experts in other domains such as bioinformatics entering or using e-Science Freefluo Workflow Enactor Core Scufl language parser Processor Web Service Soap lab Processor Local App Processor Enactor Taverna Workbench Processor Bio MOBY

GGF Summer School 24 th July 2004, Italy Scufl Unified Flow Language – SCUFL does not dictate how the workflow is to be enacted, it is inherently declarative in intent. Can potentially be translated to other workflow languages. Can be arbitrarily abstract, any given workflow engine may require further definition of the language before it can be enacted. Freefluo Workflow Enactor Core Scufl language parser Processor Web Service Soap lab Processor Local App Processor Enactor Taverna Workbench Processor Bio MOBY

GGF Summer School 24 th July 2004, Italy

One input, three outputs and eight processors. All the processors are labeled top to bottom with input ports, processor name and output ports. All the processors here are standard WSDL-described standard web services, except for “Pepstats” which is a Soaplab processor. All the links are data links except for two coordination links on the right hand side. The links are labelled with syntactic type information: “l(text/plain)” indicates a list of plain text strings.

GGF Summer School 24 th July 2004, Italy

Enactor Workflow script Failure policy Alternates list Metadata template MIR Data Store MIR Metadata Store LSID LSID + Data LSIDs + Metadata External Data Store LSID Data Services Event Notification Service Invocation + Data Events Service Discovery Workflow In and Outs

GGF Summer School 24 th July 2004, Italy Fault tolerance Failure of workflow engine –P2P architecture –XML serialisation –Checkpointing Failure of services or network –User defined retry policy –Alternate replicas –Alternate list Automatic choices for domain services undesired by users Alternate Processor Retry, delay and backoff configuration

GGF Summer School 24 th July 2004, Italy scheduled and waiting for data data ready types match data mismatch can iterate invoking success complete constructing iterator errortimeout retries left alternate available service failure creating alternate processor aborted done iterating invoking with implicit iteration successerrortimeout adding item to result data set retries left aborted instantiation error waiting to retry waiting to retry allow partials Fault tolerance

GGF Summer School 24 th July 2004, Italy Status reporting

GGF Summer School 24 th July 2004, Italy Whither BPEL? Focus: scripting simple request/response services vs. choreographing business processes Complexity: Scufl is simple enough for bioinformaticians to develop workflows Generality: Extensible processor support vs. Web Services only Provenance generation

GGF Summer School 24 th July 2004, Italy What needs to be done Free-standing web service Long-running workflows –Computationally-intensive services –Access to a reliable high performance BLAST service that reflects NCBI Blast – NCBioGrid? Scalability –Large documents – data staging Debugging environment – services / workflows are brittle. Interactivity –Version 1 had user proxy as an actor –The Original Process split into 3 steps: Identification of candidate overlapping nucleotide sequences Characterisation of nucleotide sequence Characterisation of any gene product in the sequence

GGF Summer School 24 th July 2004, Italy OGSA-DQP Used in Grave’s Disease Uses OGSA-DAI data access services to access individual data resources. A single query to access and join data from more than one OGSA-DAI wrapped data resource. Supports orchestration of computational as well as data access services. Interactive interface for integrating resources and executing requests. Implicit, pipelined and partitioned parallelism.

GGF Summer School 24 th July 2004, Italy Publications T Oinn, M Addis, J Ferris, D Marvin, M Senger, M Greenwood, T Carver, K Glover, Matthew R. Pocock, A Wipat, P Li. Taverna: A tool for the composition and enactment of bioinformatics workflows accepted for Bioinformatics Journal, 16 June 2004 T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Greenwood, C. Goble, A. Wipat, P. Li, T. Carver Delivering Web Service Coordination Capability to Users In Thirteenth International World Wide Web Conference (WWW2004) pp , New York, May M Addis, J Ferris, M Greenwood, D Marvin, P Li, T Oinn and A Wipat Experiences with eScience workflow specification and enactment in bioinformatics, Proceedings of UK e-Science All Hands Meeting 2003, pages M.N. Alpdemir, A. Mukherjee, N.W. Paton, P. Watson, A.A.A. Fernandes, A. Gounaris and J. Smith Service-based Distributed Querying on the Grid in the Proceedings of the First International Conference on Service Oriented Computing, 15-18, December 2003 Trento, Italy. Springer. J. Smith, A. Gounaris, P. Watson, N.W. Paton, A.A.A. Fernandes and Rizos Sakellariou Distributed Query Processing on the Grid in International Journal of High Performance Computing Applications, Volume 17, Issue 04, November Nick Sharman, Nedim Alpdemir, Justin Ferris, Mark Greenwood, Peter Li, Chris Wroe, The myGrid Information Model, Proc UK e-Science 2nd All Hands Meeting, Nottingham, UK 1-3 Sept 2004.