UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012 Was Derived From
Provenance in eScience Metadata that captures history of an experiment Problem diagnosis Result interpretation Experiment reproducibility Scientific Workflow Community Provenance Challenges 2006: understanding and sharing information about provenance representations and capabilities 2006: interoperability of different provenance 2009: evaluating various aspects of OPM 2010: showcase OPM in the context of novel applications Open Provenance Model W3C Provenance Working Group UTPB – University of Texas Provenance Benchmark
SWFMS and Provenance Taverna Kepler View VisTrails, Pegasus Swift Galaxy Triana OPMProv Karma RDFProv etc. UTPB – University of Texas Provenance Benchmark Support provenance collection Use proprietary of third-party systems to manage provenance Differ in provenance models, provenance vocabularies, inference support, and query languages.
Provenance Management Requirements Non-functional Data storage and querying efficiency and scalability Inference soundness and completeness Functional Support of a particular, provenance model, provenance vocabulary, query type, inference feature, visualization and analysis No standard way to evaluate provenance systems with respect to these requirements UTPB – University of Texas Provenance Benchmark
Provenance System Benchmarking Challenges Well-documented and easy-to-understand datasets Provenance data in a range of sizes Provenance data with predefined inferred results that are known to be correct and complete Test queries Performance metrics Result interpretation Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details) UTPB – University of Texas Provenance Benchmark
Our Contributions University of Texas Provenance Benchmark (UTPB) Focus on scalability and inference Flexible data generator 27 provenance templates 3 virtual workflows 3 workflow execution scenarios 3 provenance vocabularies 27 test queries in 11 categories 5 performance metrics UTPB – University of Texas Provenance Benchmark
Talk Outline University of Texas Provenance Benchmark UTPB Architecture Provenance Templates Provenance Generation UTPB Queries Performance Metrics Interpretation of Benchmark Results Summary and Future work UTPB – University of Texas Provenance Benchmark
UTPB Architecture UTPB – University of Texas Provenance Benchmark
UTPB Architecture UTPB – University of Texas Provenance Benchmark
Provenance Templates UTPB – University of Texas Provenance Benchmark
Provenance Templates A provenance template is a document that serializes provenance of one workflow execution according to a particular provenance model and a provenance vocabulary. Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field. UTPB currently supports: 1 provenance model (OPM) 3 virtual workflows 3 provenance vocabularies (OPMV, OPMO, OPMX) 3 workflow execution scenarios 1 x 3 x 3 x 3 = 27 provenance templates UTPB – University of Texas Provenance Benchmark
Virtual Workflow 1 Database Experiment Processes: 7 Artifacts:14 Accounts: 2 Agents: 1 UTPB – University of Texas Provenance Benchmark
Virtual Workflow 2 Jeans Manufacturing Processes: 13 Artifacts:18 Accounts: 3 Agents: 2 Several processes use and generate the same artifacts and are “executed” in parallel UTPB – University of Texas Provenance Benchmark
Virtual Workflow 3 French Press Coffee Processes: 15 Artifacts:15 Accounts: 4 Agents: 0 Several branches with multiple processes are “executed” in parallel Several processes trigger each other without the record of using or generating artifacts UTPB – University of Texas Provenance Benchmark
Provenance Vocabularies Almost every existing scientific workflow management system defines its own proprietary model for provenance Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas. Open Provenance Model (OPM) – a layer of interoperability OPM Vocabulary OPM Ontology OPM XML Schema UTPB – University of Texas Provenance Benchmark
Workflow Execution Scenarios successful execution incomplete execution with an error successful execution with materialized provenance inferences UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation UTPB – University of Texas Provenance Benchmark
Provenance Generation # Named graph: utpb:. utpb:account_black_C0_T0 rdf:type. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:denim_C0_T0 rdfs:label "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0. utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0. # Default graph. OPMV UTPB – University of Texas Provenance Benchmark
Provenance Generation # Named graph: utpb:. utpb:account_black_C0_T0 rdf:type opmo:Account. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:used1_C0_T0 rdf:type opmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0. utpb:wgb1_C0_T0 rdf:type opmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0. # Default graph. OPMO UTPB – University of Texas Provenance Benchmark
Provenance Generation <utpb xmlns=" xmlns:xsi=" xmlns:xsd=" laser Cutting machine OPMX UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries 27 Queries 11 Categories Graphs Dependencies Artifacts Processes Accounts Agents Roles Values Cross-Graph Queries Inferences Application-Specific UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries UTPB – University of Texas Provenance Benchmark
UTPB Queries TypeFormatSample Query EnglishFind all artifact derivation dependencies in a particular provenance graph SPARQLOPMV SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?effectArtifact opmv:wasDerivedFrom ?causeArtifact. } } SPARQLOPMO SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?wdf rdf:type opmo:WasDerivedFrom. ?wdf opmo:cause ?causeArtifact. ?wdf opmo:effect ?effectArtifact. } } XQueryOPMX declare default element namespace " { for $wdf in return {$wdf/effect}{$wdf/cause} } UTPB – University of Texas Provenance Benchmark
UTPB Queries effectArtifact causeArtifact utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMV UTPB – University of Texas Provenance Benchmark
UTPB Queries effectArtifact causeArtifact utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMO UTPB – University of Texas Provenance Benchmark
UTPB Queries <result xmlns=" xmlns:xsd=" xmlns:xsi=" … OPMX UTPB – University of Texas Provenance Benchmark
Performance Metrics UTPB – University of Texas Provenance Benchmark
Performance Metrics Data loading time Repository size Query response time Query soundness Query completeness UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results UTPB – University of Texas Provenance Benchmark
Interpretation of Benchmark Results Comparison across datasets of varying sizes Comparison using a fixed dataset Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO) Comparison across data managed using different technologies (e.g., RDF vs. XML) Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future UTPB – University of Texas Provenance Benchmark
Summary and Future Work UTPB – University of Texas Provenance Benchmark
Summary and Future Work UTPB: A first formal benchmark for scientific workflow provenance management systems Extensible with new provenance templates Flexible data generation Large selection of test queries Well defined performance metrics Future work Benchmarking existing system using UTPB Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness) UTPB – University of Texas Provenance Benchmark
THANK YOU! Questions? UTPB – University of Texas Provenance Benchmark UTPB website: My contact information: Artem Chebotko, Department of Computer Science, University of Texas – Pan American