Presentation is loading. Please wait.

Presentation is loading. Please wait.

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X.

Similar presentations


Presentation on theme: "UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X."— Presentation transcript:

1 UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012 Was Derived From

2 Provenance in eScience  Metadata that captures history of an experiment  Problem diagnosis  Result interpretation  Experiment reproducibility  Scientific Workflow Community Provenance Challenges  2006: understanding and sharing information about provenance representations and capabilities  2006: interoperability of different provenance  2009: evaluating various aspects of OPM  2010: showcase OPM in the context of novel applications  Open Provenance Model  W3C Provenance Working Group UTPB – University of Texas Provenance Benchmark

3 SWFMS and Provenance  Taverna  Kepler  View  VisTrails,  Pegasus  Swift  Galaxy  Triana  OPMProv  Karma  RDFProv  etc. UTPB – University of Texas Provenance Benchmark  Support provenance collection  Use proprietary of third-party systems to manage provenance  Differ in provenance models, provenance vocabularies, inference support, and query languages.

4 Provenance Management Requirements  Non-functional  Data storage and querying efficiency and scalability  Inference soundness and completeness  Functional  Support of a particular, provenance model, provenance vocabulary, query type, inference feature, visualization and analysis  No standard way to evaluate provenance systems with respect to these requirements UTPB – University of Texas Provenance Benchmark

5 Provenance System Benchmarking Challenges  Well-documented and easy-to-understand datasets  Provenance data in a range of sizes  Provenance data with predefined inferred results that are known to be correct and complete  Test queries  Performance metrics  Result interpretation  Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details) UTPB – University of Texas Provenance Benchmark

6 Our Contributions  University of Texas Provenance Benchmark (UTPB)  http://faculty.utpa.edu/chebotkoa/utpb/ http://faculty.utpa.edu/chebotkoa/utpb/  Focus on scalability and inference  Flexible data generator  27 provenance templates  3 virtual workflows  3 workflow execution scenarios  3 provenance vocabularies  27 test queries in 11 categories  5 performance metrics UTPB – University of Texas Provenance Benchmark

7 Talk Outline  University of Texas Provenance Benchmark  UTPB Architecture  Provenance Templates  Provenance Generation  UTPB Queries  Performance Metrics  Interpretation of Benchmark Results  Summary and Future work UTPB – University of Texas Provenance Benchmark

8 UTPB Architecture UTPB – University of Texas Provenance Benchmark

9 UTPB Architecture UTPB – University of Texas Provenance Benchmark

10 Provenance Templates UTPB – University of Texas Provenance Benchmark

11 Provenance Templates  A provenance template is a document that serializes provenance of one workflow execution according to a particular provenance model and a provenance vocabulary.  Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field.  UTPB currently supports:  1 provenance model (OPM)  3 virtual workflows  3 provenance vocabularies (OPMV, OPMO, OPMX)  3 workflow execution scenarios  1 x 3 x 3 x 3 = 27 provenance templates UTPB – University of Texas Provenance Benchmark

12 Virtual Workflow 1  Database Experiment  Processes: 7  Artifacts:14  Accounts: 2  Agents: 1 UTPB – University of Texas Provenance Benchmark

13 Virtual Workflow 2  Jeans Manufacturing  Processes: 13  Artifacts:18  Accounts: 3  Agents: 2  Several processes use and generate the same artifacts and are “executed” in parallel UTPB – University of Texas Provenance Benchmark

14 Virtual Workflow 3  French Press Coffee  Processes: 15  Artifacts:15  Accounts: 4  Agents: 0  Several branches with multiple processes are “executed” in parallel  Several processes trigger each other without the record of using or generating artifacts UTPB – University of Texas Provenance Benchmark

15 Provenance Vocabularies  Almost every existing scientific workflow management system defines its own proprietary model for provenance  Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas.  Open Provenance Model (OPM) – a layer of interoperability  OPM Vocabulary  OPM Ontology  OPM XML Schema UTPB – University of Texas Provenance Benchmark

16 Workflow Execution Scenarios  successful execution  incomplete execution with an error  successful execution with materialized provenance inferences UTPB – University of Texas Provenance Benchmark

17 Provenance Generation UTPB – University of Texas Provenance Benchmark

18 Provenance Generation UTPB – University of Texas Provenance Benchmark

19 Provenance Generation UTPB – University of Texas Provenance Benchmark

20 Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmv:. @prefix rdf:. @prefix rdfs:. @prefix utpb:. utpb:account_black_C0_T0 rdf:type. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:denim_C0_T0 rdfs:label "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0. utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0. # Default graph. OPMV UTPB – University of Texas Provenance Benchmark

21 Provenance Generation # Named graph: http://cs.panam.edu/utpb#opmGraph_C0_T0 @prefix opmo:. @prefix opmv:. @prefix rdf:. @prefix rdfs:. @prefix owl:. @prefix utpb:. utpb:account_black_C0_T0 rdf:type opmo:Account. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:used1_C0_T0 rdf:type opmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0. utpb:wgb1_C0_T0 rdf:type opmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0. # Default graph. OPMO UTPB – University of Texas Provenance Benchmark

22 Provenance Generation <utpb xmlns="http://openprovenance.org/model/opmx#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> laser Cutting machine OPMX UTPB – University of Texas Provenance Benchmark

23 UTPB Queries UTPB – University of Texas Provenance Benchmark

24 UTPB Queries  27 Queries  11 Categories  Graphs  Dependencies  Artifacts  Processes  Accounts  Agents  Roles  Values  Cross-Graph Queries  Inferences  Application-Specific UTPB – University of Texas Provenance Benchmark

25 UTPB Queries UTPB – University of Texas Provenance Benchmark

26 UTPB Queries UTPB – University of Texas Provenance Benchmark

27 UTPB Queries TypeFormatSample Query EnglishFind all artifact derivation dependencies in a particular provenance graph SPARQLOPMV SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?effectArtifact opmv:wasDerivedFrom ?causeArtifact. } } SPARQLOPMO SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?wdf rdf:type opmo:WasDerivedFrom. ?wdf opmo:cause ?causeArtifact. ?wdf opmo:effect ?effectArtifact. } } XQueryOPMX declare default element namespace "http://openprovenance.org/model/opmx#"; { for $wdf in /utpb/opmGraph[@id="opmGraph_C0_T0"]/dependencies/wasDerivedFrom return {$wdf/effect}{$wdf/cause} } UTPB – University of Texas Provenance Benchmark

28 UTPB Queries effectArtifact causeArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMV UTPB – University of Texas Provenance Benchmark

29 UTPB Queries effectArtifact causeArtifact --------------------------------------------- utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMO UTPB – University of Texas Provenance Benchmark

30 UTPB Queries <result xmlns="http://openprovenance.org/model/opmx#" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> … OPMX UTPB – University of Texas Provenance Benchmark

31 Performance Metrics UTPB – University of Texas Provenance Benchmark

32 Performance Metrics  Data loading time  Repository size  Query response time  Query soundness  Query completeness UTPB – University of Texas Provenance Benchmark

33 Interpretation of Benchmark Results UTPB – University of Texas Provenance Benchmark

34 Interpretation of Benchmark Results  Comparison across datasets of varying sizes  Comparison using a fixed dataset  Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO)  Comparison across data managed using different technologies (e.g., RDF vs. XML)  Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future UTPB – University of Texas Provenance Benchmark

35 Summary and Future Work UTPB – University of Texas Provenance Benchmark

36 Summary and Future Work  UTPB: A first formal benchmark for scientific workflow provenance management systems  Extensible with new provenance templates  Flexible data generation  Large selection of test queries  Well defined performance metrics  Future work  Benchmarking existing system using UTPB  Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness) UTPB – University of Texas Provenance Benchmark

37 THANK YOU! Questions? UTPB – University of Texas Provenance Benchmark  UTPB website:  http://faculty.utpa.edu/chebotkoa/utpb/ http://faculty.utpa.edu/chebotkoa/utpb/  My contact information:  Artem Chebotko, Department of Computer Science, University of Texas – Pan American  chebotkoa@utpa.edu chebotkoa@utpa.edu  http://www.cs.panam.edu/~artem http://www.cs.panam.edu/~artem


Download ppt "UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X."

Similar presentations


Ads by Google