UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X.

Slides:



Advertisements
Similar presentations
Ontology-Based Computing Kenneth Baclawski Northeastern University and Jarg.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
1 Ontolog OOR Use Case Review Todd Schneider 1 April 2010 (v 1.2)
OLAC Metadata Steven Bird University of Melbourne / University of Pennsylvania OLAC Workshop 10 December 2002.
1 ICS-FORTH EU-NSF Semantic Web Workshop 3-5 Oct Christophides Vassilis Database Technology for the Semantic Web Vassilis Christophides Dimitris Plexousakis.
Introduction to Planets Hans Hofman Nationaal Archief Netherlands Prague, 17 October 2008.
Meta Data Larry, Stirling md on data access – data types, domain meta-data discovery Scott, Ohio State – caBIG md driven architecture semantic md Alexander.
Brief Introduction to Provenance "As data becomes plentiful, verifiable truth becomes scarce
Feedback on OPM Yogesh Simmhan Microsoft Research Synthesis of pairwise conversations with: Roger Barga Satya Sahoo Microsoft Research Beth Plale Abhijit.
Q UERY L ANGUAGE C ONSTRUCTS FOR P ROVENANCE Murali Mani, Mohamad Alawa, Arunlal Kalyanasundaram University of Michigan, Flint Presented at IDEAS 2011.
Open Provenance Model Tutorial Session 6: Interoperability.
A Stepwise Modeling Approach for Individual Media Semantics Annett Mitschick, Klaus Meißner TU Dresden, Department of Computer Science, Multimedia Technology.
Open Provenance Model Tutorial Session 2: OPM Overview and Semantics Luc Moreau University of Southampton.
WIMS 2014, June 2-4Thessaloniki, Greece1 Optimized Backward Chaining Reasoning System for a Semantic Web Hui Shi, Kurt Maly, and Steven Zeil Contact:
Open Provenance Model Tutorial Session 7: Open Provenance Model Vocabulary.
ICS-FORTH May 23, An Ontological Approach to Digital Preservation Metadata Martin Doerr Foundation for Research and Technology - Hellas Institute.
An Introduction to Semantic Web Portal
1 UIM with DAML-S Service Description Team Members: Jean-Yves Ouellet Kevin Lam Yun Xu.
KOM, SEKE, June 20, 2004 Representing Chains of Custody Along a Forensic Process: A Case Study on Kruse Model Tamer Fares Gayed, UQAM Hakim Lounis, UQAM.
27 January Semantically Coordinated E-Market Semantic Web Term Project Prepared by Melike Şah 27 January 2005.
CS570 Artificial Intelligence Semantic Web & Ontology 2
Open Provenance Model Tutorial Session 3: OPM Serializations Luc Moreau University of Southampton.
A BRIEF INTRO TO THE PROV DATA MODEL Simon Miles The entire W3C Provenance Working Group.
Using Specimen Data in Scientific Workflow Environments to Connect to Metadata Archive and Discovery Services in Environmental Biology CJ Grady, J.H. Beach,
Storing, Indexing and Querying Large Provenance Data Sets as RDF Graphs in Apache HBase Artem Chebotko Joint work with John Abraham and Pearl Brazier University.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Semantic Representation of Temporal Metadata in a Virtual Observatory Han Wang 1 Eric Rozell 1
Semantic Representation of Temporal Metadata in a Virtual Observatory Han Wang 1 Eric Rozell 1
Open Provenance Model Tutorial Session 5: OPM Emerging Profiles.
MDC Open Information Model West Virginia University CS486 Presentation Feb 18, 2000 Lijian Liu (OIM:
Guillaume Rivalle APRIL 2014 MEASURE YOUR RESEARCH PERFORMANCE WITH INCITES.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
ONTOLOGY SUPPORT For the Semantic Web. THE BIG PICTURE  Diagram, page 9  html5  xml can be used as a syntactic model for RDF and DAML/OIL  RDF, RDF.
January, 23, 2006 Ilkay Altintas
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
 Copyright 2005 Digital Enterprise Research Institute. All rights reserved. Towards Translating between XML and WSML based on mappings between.
Data on the Web Life Cycle Bernadette Farias Lóscio March, 2014.
The Semantic Web Service Shuying Wang Outline Semantic Web vision Core technologies XML, RDF, Ontology, Agent… Web services DAML-S.
Of 39 lecture 2: ontology - basics. of 39 ontology a branch of metaphysics relating to the nature and relations of being a particular theory about the.
Database System Concepts and Architecture
RDF and OWL Developing Semantic Web Services by H. Peter Alesso and Craig F. Smith CMPT 455/826 - Week 6, Day Sept-Dec 2009 – w6d21.
Recording application executions enriched with domain semantics of computations and data Master of Science Thesis Michał Pelczar Krakow,
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Abstract We present two Model Driven Engineering (MDE) tools, namely the Eclipse Modeling Framework (EMF) and Umple. We identify the structure and characteristic.
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
A Systemic Approach for Effective Semantic Access to Cultural Content Ilianna Kollia, Vassilis Tzouvaras, Nasos Drosopoulos and George Stamou Presenter:
Dr. Bhavani Thuraisingham The University of Texas at Dallas Trustworthy Semantic Webs March 25, 2011 Data and Applications Security Developments and Directions.
Christoph F. Eick University of Houston Organization 1. What are Ontologies? 2. What are they good for? 3. Ontologies and.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 1 DATABASE SYSTEMS Instructor Ms. Arwa Binsaleh.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
Semantic Publishing Benchmark Task Force Fourth TUC Meeting, Amsterdam, 03 April 2014.
The Semantic Web Riccardo Rosati Dottorato in Ingegneria Informatica Sapienza Università di Roma a.a. 2006/07.
A Semantic Web Approach for the Third Provenance Challenge Tetherless World Rensselaer Polytechnic Institute James Michaelis, Li Ding,
1 Class exercise II: Use Case Implementation Deborah McGuinness and Peter Fox CSCI Week 8, October 20, 2008.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
OOI Cyberinfrastructure and Semantics OOI CI Architecture & Design Team UCSD/Calit2 Ocean Observing Systems Semantic Interoperability Workshop, November.
VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.
Ewa Deelman, Virtual Metadata Catalogs: Augmenting Existing Metadata Catalogs with Semantic Representations Yolanda Gil, Varun Ratnakar,
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Technologies Stuart N. Wrigley 1, Raúl García-Castro 2 and Cassia Trojahn 3 1.
Distributed Storage and Querying Techniques for a Semantic Web of Scientific Workflow Provenance The ProvBase System Artem Chebotko (joint work with.
Middleware independent Information Service
Workflow Provenance Bill Howe.
Piotr Kaminski University of Victoria September 24th, 2002
One Language. One Enterprise.™
LOD reference architecture
Presentation transcript:

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X. Lian, and C. Reilly Department of Computer Science University of Texas - Pan American 6th IEEE International Workshop on Scientific Workflows, June 24, 2012 Was Derived From

Provenance in eScience  Metadata that captures history of an experiment  Problem diagnosis  Result interpretation  Experiment reproducibility  Scientific Workflow Community Provenance Challenges  2006: understanding and sharing information about provenance representations and capabilities  2006: interoperability of different provenance  2009: evaluating various aspects of OPM  2010: showcase OPM in the context of novel applications  Open Provenance Model  W3C Provenance Working Group UTPB – University of Texas Provenance Benchmark

SWFMS and Provenance  Taverna  Kepler  View  VisTrails,  Pegasus  Swift  Galaxy  Triana  OPMProv  Karma  RDFProv  etc. UTPB – University of Texas Provenance Benchmark  Support provenance collection  Use proprietary of third-party systems to manage provenance  Differ in provenance models, provenance vocabularies, inference support, and query languages.

Provenance Management Requirements  Non-functional  Data storage and querying efficiency and scalability  Inference soundness and completeness  Functional  Support of a particular, provenance model, provenance vocabulary, query type, inference feature, visualization and analysis  No standard way to evaluate provenance systems with respect to these requirements UTPB – University of Texas Provenance Benchmark

Provenance System Benchmarking Challenges  Well-documented and easy-to-understand datasets  Provenance data in a range of sizes  Provenance data with predefined inferred results that are known to be correct and complete  Test queries  Performance metrics  Result interpretation  Existing empirical studies of provenance systems use ad-hoc benchmarks or benchmarks developed in other research domains (see the paper for details) UTPB – University of Texas Provenance Benchmark

Our Contributions  University of Texas Provenance Benchmark (UTPB)   Focus on scalability and inference  Flexible data generator  27 provenance templates  3 virtual workflows  3 workflow execution scenarios  3 provenance vocabularies  27 test queries in 11 categories  5 performance metrics UTPB – University of Texas Provenance Benchmark

Talk Outline  University of Texas Provenance Benchmark  UTPB Architecture  Provenance Templates  Provenance Generation  UTPB Queries  Performance Metrics  Interpretation of Benchmark Results  Summary and Future work UTPB – University of Texas Provenance Benchmark

UTPB Architecture UTPB – University of Texas Provenance Benchmark

UTPB Architecture UTPB – University of Texas Provenance Benchmark

Provenance Templates UTPB – University of Texas Provenance Benchmark

Provenance Templates  A provenance template is a document that serializes provenance of one workflow execution according to a particular provenance model and a provenance vocabulary.  Provenance templates make the benchmark extensible and thus adaptable to the changing requirements of the field.  UTPB currently supports:  1 provenance model (OPM)  3 virtual workflows  3 provenance vocabularies (OPMV, OPMO, OPMX)  3 workflow execution scenarios  1 x 3 x 3 x 3 = 27 provenance templates UTPB – University of Texas Provenance Benchmark

Virtual Workflow 1  Database Experiment  Processes: 7  Artifacts:14  Accounts: 2  Agents: 1 UTPB – University of Texas Provenance Benchmark

Virtual Workflow 2  Jeans Manufacturing  Processes: 13  Artifacts:18  Accounts: 3  Agents: 2  Several processes use and generate the same artifacts and are “executed” in parallel UTPB – University of Texas Provenance Benchmark

Virtual Workflow 3  French Press Coffee  Processes: 15  Artifacts:15  Accounts: 4  Agents: 0  Several branches with multiple processes are “executed” in parallel  Several processes trigger each other without the record of using or generating artifacts UTPB – University of Texas Provenance Benchmark

Provenance Vocabularies  Almost every existing scientific workflow management system defines its own proprietary model for provenance  Each model is serialized in some format, such as RDF, XML, or relational data, according to one or more predefined vocabularies or schemas.  Open Provenance Model (OPM) – a layer of interoperability  OPM Vocabulary  OPM Ontology  OPM XML Schema UTPB – University of Texas Provenance Benchmark

Workflow Execution Scenarios  successful execution  incomplete execution with an error  successful execution with materialized provenance inferences UTPB – University of Texas Provenance Benchmark

Provenance Generation UTPB – University of Texas Provenance Benchmark

Provenance Generation UTPB – University of Texas Provenance Benchmark

Provenance Generation UTPB – University of Texas Provenance Benchmark

Provenance Generation # Named graph: utpb:. utpb:account_black_C0_T0 rdf:type. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:denim_C0_T0 rdfs:label "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:cutDenim_C0_T0 opmv:used utpb:cuttingMachine_C0_T0, utpb:cuttingPattern_C0_T0, utpb:denim_C0_T0. utpb:denimParts_C0_T0 opmv:wasGeneratedBy utpb:cutDenim_C0_T0. # Default graph. OPMV UTPB – University of Texas Provenance Benchmark

Provenance Generation # Named graph: utpb:. utpb:account_black_C0_T0 rdf:type opmo:Account. utpb:cuttingMachine_C0_T0 rdf:type opmv:Artifact. utpb:propertyDenim_C0_T0 opmo:key utpb:keyDenimType_C0_T0 ; opmo:value "blue". utpb:andrey_C0_T0 rdf:type opmv:Agent. utpb:used1_C0_T0 rdf:type opmo:Used ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:cuttingMachine_C0_T0 ; opmo:role utpb:roleMachine_C0_T0 ; opmo:pname utpb:_used1 ; opmo:account utpb:account_black_C0_T0. utpb:wgb1_C0_T0 rdf:type opmo:WasGeneratedBy ; opmo:effect utpb:cutDenim_C0_T0 ; opmo:cause utpb:denimParts_C0_T0 ; opmo:role utpb:roleDenim_C0_T0 ; opmo:pname utpb:_wgb1 ; opmo:account utpb:account_black_C0_T0. # Default graph. OPMO UTPB – University of Texas Provenance Benchmark

Provenance Generation <utpb xmlns=" xmlns:xsi=" xmlns:xsd=" laser Cutting machine OPMX UTPB – University of Texas Provenance Benchmark

UTPB Queries UTPB – University of Texas Provenance Benchmark

UTPB Queries  27 Queries  11 Categories  Graphs  Dependencies  Artifacts  Processes  Accounts  Agents  Roles  Values  Cross-Graph Queries  Inferences  Application-Specific UTPB – University of Texas Provenance Benchmark

UTPB Queries UTPB – University of Texas Provenance Benchmark

UTPB Queries UTPB – University of Texas Provenance Benchmark

UTPB Queries TypeFormatSample Query EnglishFind all artifact derivation dependencies in a particular provenance graph SPARQLOPMV SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?effectArtifact opmv:wasDerivedFrom ?causeArtifact. } } SPARQLOPMO SELECT ?causeArtifact ?effectArtifact FROM NAMED WHERE { GRAPH utpb:opmGraph { ?wdf rdf:type opmo:WasDerivedFrom. ?wdf opmo:cause ?causeArtifact. ?wdf opmo:effect ?effectArtifact. } } XQueryOPMX declare default element namespace " { for $wdf in return {$wdf/effect}{$wdf/cause} } UTPB – University of Texas Provenance Benchmark

UTPB Queries effectArtifact causeArtifact utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMV UTPB – University of Texas Provenance Benchmark

UTPB Queries effectArtifact causeArtifact utpb:denimParts_C0_T0 utpb:denim_C0_T0 utpb:rawJeans_C0_T0 utpb:denimParts_C0_T0 utpb:rawJeans_C0_T0 utpb:sewingThread_C0_T0 utpb:washedJeans_C0_T0utpb:rawJeans_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:inspectedJeans_C0_T0 utpb:buttonedJeans_C0_T0utpb:buttons_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 utpb:jeans_C0_T0 utpb:qualityJeans_C0_T0 utpb:jeans_C0_T0 utpb:labels_C0_T0 utpb:inspectedJeans_C0_T0utpb:washedJeans_C0_T0 utpb:qualityJeans_C0_T0utpb:buttonedJeans_C0_T0 OPMO UTPB – University of Texas Provenance Benchmark

UTPB Queries <result xmlns=" xmlns:xsd=" xmlns:xsi=" … OPMX UTPB – University of Texas Provenance Benchmark

Performance Metrics UTPB – University of Texas Provenance Benchmark

Performance Metrics  Data loading time  Repository size  Query response time  Query soundness  Query completeness UTPB – University of Texas Provenance Benchmark

Interpretation of Benchmark Results UTPB – University of Texas Provenance Benchmark

Interpretation of Benchmark Results  Comparison across datasets of varying sizes  Comparison using a fixed dataset  Comparison across data serialized with different vocabularies (e.g., OPMV vs. OPMO)  Comparison across data managed using different technologies (e.g., RDF vs. XML)  Comparison across data of different provenance models (e.g., OPM vs. PROV-DM) – in the future UTPB – University of Texas Provenance Benchmark

Summary and Future Work UTPB – University of Texas Provenance Benchmark

Summary and Future Work  UTPB: A first formal benchmark for scientific workflow provenance management systems  Extensible with new provenance templates  Flexible data generation  Large selection of test queries  Well defined performance metrics  Future work  Benchmarking existing system using UTPB  Extending UTPB (functional requirements, PROV-DM, new metrics – query expressiveness) UTPB – University of Texas Provenance Benchmark

THANK YOU! Questions? UTPB – University of Texas Provenance Benchmark  UTPB website:   My contact information:  Artem Chebotko, Department of Computer Science, University of Texas – Pan American  