Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester
Outline Short team introduction Workflow implementation Provenance schema and storage Provenance queries Suggestions Reflection Acknowledgement
Provenance Challenge Overview Given an abstract workflow Implement this workflow in your system Collect provenance from runs of this workflow Present the implemented workflow and collected provenance Answer a list of provenance questions and present these answers
Taverna and my Grid A UK e-Science project to build middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications. Sequence analysis, microarray analysis, proteomics, chemoinformatics, image processing, rendering Dilbert cartoons acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt
Scufl Data links Control links: limited support Failure tolerance: retry and alternative services Implicit iterations: cross/dot iterations Nested workflows Semantic metadata annotations
What has to be done Design the workflow using Scufl in Taverna Build services (Web services, Soaplab services, local java, or beanshell scripts) to implement each process Gather and process the real data products
Doing it properly Wrap each procedure as a service Process the real data as a real experiment Use iterations, nested workflow or interactive workflows supported by Taverna Real examples: –Chimatica ( supports high throughput workflows using Taverna 1.Xhttp:// –MIAS-Grid ( uses my Grid to build medical image processing workflowshttp://
What we did actually Realize each procedure as a beanshell script, to avoid real service implementation and deployment Pass pseudo data products rather than real image data products But keep the metadata about data products along with provenance to answer semantic questions
Implemented Scufl workflow in Taverna
Provenance schema Four aspects –Workflow provenance –Data provenance –Organization provenance –Knowledge provenance Provenance ontology –RDFS –OWL-lite
Provenance Pyramid Model Knowledge Level Organiza tion Level Data Level Workflow Level serviceInvocation1 serviceInvocation2 data1 data2data3 data4 WSDL Genomic Project similarData
runsWorkflow launchedBy Organisation provenance Workflow Experimenter Organisation belongsTo hasInput executesProcessRun e.g. web service invocation of NCBI iteration e.g. NCBI Workflow run Process ProcessRun ProcessIteration Workflow provenance workflowOutput Data Data/ knowledge provenance Atomic Data derivedFrom Knowledge statements e.g. similar_sequence_to Knowledge statements e.g. similar_sequence_to createdBy Data Collection containsData isA runsProcess hasProcesses
Workflow provenance ontology
Data provenance ontology
Organization & Knowledge provenance ontology userPredicate –Semantic concept about a data product or a service, e.g. nucleotide_sequence –Semantic (knowledge) relationships between two data products, e.g. similar_sequence_to
Collected & stored provenance LSIDs used to identify: –data, workflows, workflow runs –LSIDs are names of graphs Named RDF graphs –retrieve whole workflow runs –implementation in Sesame2 native store –scalable –alpha release (bugs) NG4J (Jena + MySQL) –scalability issues Future implementations: Oracle and Boca
Answer matrix 1.Find the process that led to d 0 (Atlas X Graphic) 2.Find the process that led to d 0 (Atlas X Graphic) excluding everything prior to d 1 (the averaging of images with softmean) 3.Find the Stage 3, 4 and 5 details of the process that led to d 0 (Atlas X Graphic) 4.Find all invocations of procedure align_warp using p 0 (a twelfth order nonlinear 1365 parameter model) 5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers) had an entry global maximum=4095 Find all the d 0 that are derived from d 1 where value(d 1 ) = Find all output averaged images of softmean, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model Find all the d 0 that are derived from d 1 where derivedFrom(d 1 ) = d 2 Process provenance Data provenance
Answer matrix 7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.pgmtoppmpnmtojpeg 8. Find the outputs of align_warp where the inputs are annotated with center=UChicago. 9. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files. Provenance cross runs Knowledge provenance
Suggested Workflow Variants Implicit iterations
Suggested Workflow Variants Nested workflow runs
Suggested Workflow Variants User interactions
Suggested Queries Compare, merge and union provenance from different workflow runs Explain why different outputs were produced in repeated workflow runs Replay a workflow run
Categorisation of queries Four levels: 1. queries to support the provenance browser 2. semantic queries 3. integration queries 4. pre-canned queries to support provenance usage scenarios.
Live systems Taverna: Provenance plugin and browser beta release: bundled with the Taverna release 1.4. Provenance ontology: System requirement: –Windows, Linux, Mac –Java 5.0 –mySQL database (optional)
Reflection A systematic provenance query framework is needed Separate data and provenance metadata –Better storage scalability –Avoid archiving duplicate data products A consensus of provenance models
Acknowledgement The my Grid Taverna team: Tom Oinn, Stuart Owen, Stian Soiland, David Withers, Katy Wolstencroft and June Finch Daniele Turi: provenance plugin Matthew Gamble: Taverna provenance browser Chris Wroe from the original my Grid project