Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester.

Slides:



Advertisements
Similar presentations
Taverna: From Biology to Astronomy Dr Katy Wolstencroft University of Manchester my Grid OMII-UK.
Advertisements

Sandra Gesing Division for Simulation of Biological Systems Eberhard-Karls-Universität Tübingen Portals for Life.
Harvards PASS Takes on The Provenance Challenge September 13, 2006 Margo Seltzer Harvard University Division of Engineering and Applied Sciences.
Sandra Gesing Eberhard-Karls-Universität Tübingen Requirements on a portal for MoSGrid (Molecular Simulation.
IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,
Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.
Peter Rice Bioinformatics and Grid: Progress and Potential Peter Rice, EBI ISGC, April 2005.
Classical and myGrid approaches to data mining in bioinformatics
Taverna the story from up-above Antoon Goderis The University of Manchester, UK DART workshop, Brisbane,
Designing, Executing and Reusing Scientific Workflows Katy Wolstencroft, Paul Fisher, myGrid.
IBM Watson Research © 2004 IBM Corporation BioHaystack: Gateway to the Biological Semantic Web Dennis Quan
Karma Provenance Framework v2 Provenance Challenge Workshop/GGF18 Yogesh L. Simmhan Beth Plale, Dennis Gannon, Srinath Perera Indiana University.
Doing it again: Workflows and Ontologies Supporting Science Phillip Lord Frank Gibson Newcastle University.
Workflows within Taverna Stuart Owen University of Mancester, UK
The my Grid project aims to provide middleware layers that make the Information Grid appropriate for the needs of bioinformatics. my Grid is building high.
Personal Data Management Why is this such an issue? Data Provenance Representing links v Representing data Identifying resources: Life Science Identifiers.
The Representation of Scientific Data
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft and Dr Aleksandra.
Taverna and my Grid A solution for confusion intensive computing? Tom Oinn – EMBL-EBI,
BiodiversityWorld GRID Workshop NeSC, Edinburgh – 30 June and 1 July 2005 Metadata Agents and Semantic Mediation Mikhaila Burgess Cardiff University.
Deciding Semantic Matching of Stateless Services Duncan Hull †, Evgeny Zolin †, Andrey Bovykin ‡, Ian Horrocks †, Ulrike Sattler † and Robert Stevens †
Science, Workflows and Collections Professor Carole Goble The University of Manchester, UK
The Taverna Workbench: Integrating and analysing biological and clinical data with computerised workflows Dr Katy Wolstencroft myGrid University of Manchester.
Taverna and my Grid Basic overview and Introduction Tom Oinn
An Introduction to Taverna Workflows Franck Tanoh my Grid University of Manchester.
An Introduction to Designing and Executing Workflows with Taverna Katy Wolstencroft University of Manchester.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 10, November.
OMII-UK Software Activities Steven Newhouse, Director.
(Bio)Web Services at the INB BioMOBY. Instituto Nacional de Bioinformática.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Taverna: A Workbench for the Design and Execution of Scientific Workflows Dr Katy Wolstencroft myGrid University of Manchester.
Going with the Flow Distributed Computing for Systems Biology Using Taverna Prof Carole Goble The University of Manchester, UK
1 Interaction diagrams and activity diagrams Speaker: 陳 奕 全 Real-time and Embedded System Lab 15 August 2002.
Phase II Additions to LSG Search capability to Gene Browser –Though GUI in Gene Browser BLAST plugin that invokes remote EBI BLAST service Working set.
Taverna Workflows for Systems Biology Katy Wolstencroft School of Computer Science University of Manchester.
An Introduction to Designing and Executing Workflows with Taverna Aleksandra Pawlik materials by: Katy Wolstencroft University of Manchester.
VBI Web Services Workshop May 2005 Performing In silico Experiments in a Service Based Architecture: Solutions and Issues Chris Wroe, Phillip Lord,
A Logic Programming Approach to Scientific Workflow Provenance Querying* Shiyong Lu Department of Computer Science Wayne State University, Detroit, MI.
Provenance Challenge Simon Miles, Mike Wilde, Ian Foster and Luc Moreau.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
LSIDs in a Nutshell Jun Zhao University of Manchester 1 st December, 2005.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
Exploring Williams-Beuren Syndrome using my Grid R.D. Stevens, a H.J. Tipney, b C.J. Wroe, a T.M. Oinn, c M. Senger, c P.W. Lord, a C.A. Goble, a A. Brass,
An Identity Crisis in the Life Sciences Jun Zhao, Carole Goble and Robert Stevens The University of Manchester, UK Thanks to: Tom Oinn, Matthew Pocock,
Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,
Taverna Workbench Stuart Owen University of Mancester, UK
An Introduction to Designing, Executing and Sharing Workflows with Taverna Katy Wolstencroft myGrid University of Manchester IMPACT/Taverna Hackathon 2011.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
An modular approach to fMRI metadata in a Virtual Laboratory - generic tools for specific problems M. Scott Marshall, Kasper van den Berg, Kamel Boulebiar,
First International Workshop on Portals for Life Sciences Sandra Gesing
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
EScience Case Studies Using Taverna Dr. Georgina Moulton The University of Manchester
The Semantic Web, Service Oriented Architectures, the my Grid Experience Carole Goble
Prizms for Data Publication and Management Katie Chastain May 9, 2014.
Selected Workflow and Semantic Experiences from my Grid Professor Carole Goble The University of Manchester, UK
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
An Introduction to Taverna caBIG monthly workspace call and Taverna, Franck Tanoh.
Taverna, myExperiment and HELIO services Anja Le Blanc Stian Soiland-Reyes Alan Willams University of Manchester.
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Katy Wolstencroft and Aleksandra Pawlik.
Introduction to Workflows with Taverna and myExperiment Aleksandra Pawlik University of Manchester materials by Dr Katy Wolstencroft.
Exploring Taverna 2 Katy Wolstencroft myGrid University of Manchester.
Designing, Executing and Sharing Workflows with Taverna 2.4 Different Service Types Katy Wolstencroft Helen Hulme myGrid University of Manchester.
LSIDs in Taverna Daniele Turi University of Manchester
Distributed Computing for System Biology using Taverna Workflows
An ontology for e-Research
Taverna workflow management system
Presentation transcript:

Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Outline Short team introduction Workflow implementation Provenance schema and storage Provenance queries Suggestions Reflection Acknowledgement

Provenance Challenge Overview Given an abstract workflow Implement this workflow in your system Collect provenance from runs of this workflow Present the implemented workflow and collected provenance Answer a list of provenance questions and present these answers

Taverna and my Grid A UK e-Science project to build middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. Sequence analysis, microarray analysis, proteomics, chemoinformatics, image processing, rendering Dilbert cartoons acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

Scufl Data links Control links: limited support Failure tolerance: retry and alternative services Implicit iterations: cross/dot iterations Nested workflows Semantic metadata annotations

What has to be done Design the workflow using Scufl in Taverna Build services (Web services, Soaplab services, local java, or beanshell scripts) to implement each process Gather and process the real data products

Doing it properly Wrap each procedure as a service Process the real data as a real experiment Use iterations, nested workflow or interactive workflows supported by Taverna Real examples: –Chimatica ( supports high throughput workflows using Taverna 1.Xhttp:// –MIAS-Grid ( uses my Grid to build medical image processing workflowshttp://

What we did actually Realize each procedure as a beanshell script, to avoid real service implementation and deployment Pass pseudo data products rather than real image data products But keep the metadata about data products along with provenance to answer semantic questions

Implemented Scufl workflow in Taverna

Provenance schema Four aspects –Workflow provenance –Data provenance –Organization provenance –Knowledge provenance Provenance ontology –RDFS –OWL-lite

Provenance Pyramid Model Knowledge Level Organiza tion Level Data Level Workflow Level serviceInvocation1 serviceInvocation2 data1 data2data3 data4 WSDL Genomic Project similarData

runsWorkflow launchedBy Organisation provenance Workflow Experimenter Organisation belongsTo hasInput executesProcessRun e.g. web service invocation of NCBI iteration e.g. NCBI Workflow run Process ProcessRun ProcessIteration Workflow provenance workflowOutput Data Data/ knowledge provenance Atomic Data derivedFrom Knowledge statements e.g. similar_sequence_to Knowledge statements e.g. similar_sequence_to createdBy Data Collection containsData isA runsProcess hasProcesses

Workflow provenance ontology

Data provenance ontology

Organization & Knowledge provenance ontology userPredicate –Semantic concept about a data product or a service, e.g. nucleotide_sequence –Semantic (knowledge) relationships between two data products, e.g. similar_sequence_to

Collected & stored provenance LSIDs used to identify: –data, workflows, workflow runs –LSIDs are names of graphs Named RDF graphs –retrieve whole workflow runs –implementation in Sesame2 native store –scalable –alpha release (bugs) NG4J (Jena + MySQL) –scalability issues Future implementations: Oracle and Boca

Answer matrix 1.Find the process that led to d 0 (Atlas X Graphic) 2.Find the process that led to d 0 (Atlas X Graphic) excluding everything prior to d 1 (the averaging of images with softmean) 3.Find the Stage 3, 4 and 5 details of the process that led to d 0 (Atlas X Graphic) 4.Find all invocations of procedure align_warp using p 0 (a twelfth order nonlinear 1365 parameter model) 5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers) had an entry global maximum=4095 Find all the d 0 that are derived from d 1 where value(d 1 ) = Find all output averaged images of softmean, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model Find all the d 0 that are derived from d 1 where derivedFrom(d 1 ) = d 2 Process provenance Data provenance

Answer matrix 7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.pgmtoppmpnmtojpeg 8. Find the outputs of align_warp where the inputs are annotated with center=UChicago. 9. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. Provenance cross runs Knowledge provenance

Suggested Workflow Variants Implicit iterations

Suggested Workflow Variants Nested workflow runs

Suggested Workflow Variants User interactions

Suggested Queries Compare, merge and union provenance from different workflow runs Explain why different outputs were produced in repeated workflow runs Replay a workflow run

Categorisation of queries Four levels: 1. queries to support the provenance browser 2. semantic queries 3. integration queries 4. pre-canned queries to support provenance usage scenarios.

Live systems Taverna: Provenance plugin and browser beta release: bundled with the Taverna release 1.4. Provenance ontology: System requirement: –Windows, Linux, Mac –Java 5.0 –mySQL database (optional)

Reflection A systematic provenance query framework is needed Separate data and provenance metadata –Better storage scalability –Avoid archiving duplicate data products A consensus of provenance models

Acknowledgement The my Grid Taverna team: Tom Oinn, Stuart Owen, Stian Soiland, David Withers, Katy Wolstencroft and June Finch Daniele Turi: provenance plugin Matthew Gamble: Taverna provenance browser Chris Wroe from the original my Grid project