Provenance: an open approach to experiment validation in e- Science Professor Luc Moreau University of Southampton

Slides:



Advertisements
Similar presentations
News in XACML 3.0 and application to the cloud Erik Rissanen, Axiomatics
Advertisements

Open Provenance Model Tutorial Session 2: OPM Overview and Semantics Luc Moreau University of Southampton.
An Open Provenance Model for Scientific Workflows Professor Luc Moreau University of Southampton
Provenance: concepts, architecture and envisioned tools Professor Luc Moreau University of Southampton
UK e-Science All Hands Meeting 2005 Paul Groth, Simon Miles, Luc Moreau.
PROVENANCE FOR THE CLOUD (USENIX CONFERENCE ON FILE AND STORAGE TECHNOLOGIES(FAST `10)) Kiran-Kumar Muniswamy-Reddy, Peter Macko, and Margo Seltzer Harvard.
Architecture Tutorial Summary and Conclusions. Architecture Tutorial The Provenance Architecture.
Provenance in Distr. Organ Transplant Management Applying Provenance in Distributed Organ Management Sergio Álvarez, Javier Vázquez-Salceda, Tamás Kifor,
PrIMe PrIMe : Provenance Incorporating Methodology Steve Munroe The EU Grid Provenance Project University of Southampton UK
Architecture Tutorial 1 Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
DESIGNING A PUBLIC KEY INFRASTRUCTURE
Introduction To System Analysis and Design
Advanced Topics COMP163: Database Management Systems University of the Pacific December 9, 2008.
Ch 12 Distributed Systems Architectures
Provenance Challenges and Technologies for Grids Luc Moreau University of Southampton
Distributed Collaborations Using Network Mobile Agents Anand Tripathi, Tanvir Ahmed, Vineet Kakani and Shremattie Jaman Department of computer science.
Architectural Design Establishing the overall structure of a software system Objectives To introduce architectural design and to discuss its importance.
Open Provenance Model Tutorial Session 5: OPM Emerging Profiles.
Architecture Tutorial Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
Electronically Querying for the Provenance of Entities Simon Miles Provenance-Aware Service-Oriented Architectures.
● Problem statement ● Proposed solution ● Proposed product ● Product Features ● Web Service ● Delegation ● Revocation ● Report Generation ● XACML 3.0.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
Implementation and Evaluation of a Protocol for Recording Process Documentation in the Presence of Failures Zheng Chen and Luc Moreau
المحاضرة الثالثة. Software Requirements Topics covered Functional and non-functional requirements User requirements System requirements Interface specification.
UK e-Science All Hands Meeting 2005 Paul Groth, Simon Miles, Luc Moreau.
The GRIMOIRES Service Registry Weijian Fang and Luc Moreau School of Electronics and Computer Science University of Southampton.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Provenance Aware Service Oriented Architecture (1 year on) Professor Luc Moreau University of Southampton
Architecture Tutorial Provenance: overview Professor Luc Moreau University of Southampton
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Architecture Tutorial 1 Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
Provenance: an open approach to experiment validation in e- Science Professor Luc Moreau University of Southampton
Introduction To System Analysis and Design
XML Web Services Architecture Siddharth Ruchandani CS 6362 – SW Architecture & Design Summer /11/05.
Lecture 7: Requirements Engineering
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Security Issues in a SOA- based Provenance System Victor Tan, Paul Groth, Simon Miles, Sheng Jiang, Steve Munroe, Sofia Tsasakou and Luc Moreau PASOA/EU.
Distribution and components. 2 What is the problem? Enterprise computing is Large scale & complex: It supports large scale and complex organisations Spanning.
July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,
OPODIS'04 A protocol for recording provenance in service-oriented Grids Paul Groth, Michael Luck, Luc Moreau University of Southampton.
Formalising a protocol for recording provenance in Grids Paul Groth – University of Southampton.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
Kemal Baykal Rasim Ismayilov
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Recording Actor Provenance in Scientific Workflows Ian Wootten, Shrija Rajbhandari, Omer Rana Cardiff University, UK.
Slide 1 Service-centric Software Engineering. Slide 2 Objectives To explain the notion of a reusable service, based on web service standards, that provides.
Architecture Tutorial 1 Overview of Today’s Talks Provenance Data Structures Recording and Querying Provenance –Break (30 minutes) Distribution and Scalability.
Provenance in Distr. Organ Transplant Management EU PROVENANCE project: an open provenance architecture for distributed.
Tools for Navigating and Analysis of Provenance Information Vikas Deora, Arnaud Contes and Omer Rana.
IPDA Architecture Project International Planetary Data Alliance IPDA Architecture Project Report.
Topic 4: Distributed Objects Dr. Ayman Srour Faculty of Applied Engineering and Urban Planning University of Palestine.
Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau University of Southampton.
Distributed Systems Architectures Chapter 12. Objectives  To explain the advantages and disadvantages of different distributed systems architectures.
Provenance: an open approach to experiment validation in e-Science
Provenance: Problem, Architectural issues, Towards Trust
Distribution and components
CHAPTER 3 Architectures for Distributed Systems
Database Management System (DBMS)
Workflow Provenance Bill Howe.
Chapter 2 Database Environment Pearson Education © 2009.
Service-centric Software Engineering
Chapter 5 Designing the Architecture Shari L. Pfleeger Joanne M. Atlee
WEB SERVICES From Chapter 19, Distributed Systems
Chapter 2 Database Environment Pearson Education © 2009.
Chapter 2 Database Environment Pearson Education © 2009.
Presentation transcript:

Provenance: an open approach to experiment validation in e- Science Professor Luc Moreau University of Southampton

Provenance & PASOA Teams University of Southampton Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen IBM UK (EU Project Coordinator) John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari Universitad Politecnica de Catalunya (UPC) Steven Willmott, Javier Vazquez SZTAKI Laszlo Varga, Arpad Andics, Tamas Kifor German Aerospace Andreas Schreiber, Guy Kloss, Frank Danneman

Contents Motivation Provenance Concepts Provenance Architecture Standardisation Provenance Queries Conclusions

Motivation

Scientific Research Academic Peer Review

Audit & Business Regulations Audit: - Sarbanes-Oxley - Basel II - European Rec. R(97)5 (protection of medical data) - …. Accounting Banking Healthcare

e-Science datasets How to undertake peer-reviewing and validation of e-Scientific results?

Sarbanes-Oxley The American Competitiveness and Corporate Accountability Act of 2002, commonly known as the Sarbanes-Oxley Act, was signed into law on July 30, The law is intended to protect investors by improving the accuracy and reliability of corporate disclosures. Sarbanes-Oxley also defines a higher level of responsibility, accountability, and financial reporting transparency - changes that are intended to return confidence to investors, as well.

Food & Drug Administration

Basel II

Compliance to Regulations The “next-compliance” problem Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance?

Current Solutions Proprietary, Monolithic Silos, Closed Do not inter-operate with other applications Not adaptable to new regulations

Provenance Oxford English Dictionary: the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the passage of an item through its various owners. Concept vs representation

Provenance in Computer Systems Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

Our Approach Define core concepts pertaining to provenance Specify functionality required to become “provenance-aware” Define open data models and protocols that allow systems to inter-operate Standardise data models and protocols Provide a reference implementation Provide reasoning capability

Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

Context (2) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN) Bioinformatics: verification and auditing of “experiments” (e.g. for drug approval)

Provenance Concepts

Provenance “Lifecycle” Application Data Results Provenance Store Record Documentation of Execution Query and Reason over Provenance of Data Administer Store and its contents Core Interfaces to Provenance Store

Nature of Documentation We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual views of the actors involved; it can provide operational details of execution or it can be abstract.

p-assertion A given element of process documentation will be referred to as a p-assertion p-assertion: is an assertion that is made by an actor and pertains to a process.

Service Oriented Architecture Broad definition of service as component that takes some inputs and produces some outputs. Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. Interactions with services take place with messages that are constructed according to services interface specification. The term actor denotes either a client or a service in a SOA. A process is defined as execution of a workflow

M1 M2 M3 M4 Actor 1 Actor 2 I received M1, M4 I sent M2, M3 I received M3 I sent M4 From these p-assertions, we can derive that M3 was sent by Actor 1 and received by Actor 2 (and likewise for M4) If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages Process Documentation (1)

M1 M2 M3 M4 Actor 1 Actor 2 M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M4 is in reply to M3 These assertions help identify order of messages, but not how data was computed Process Documentation (2)

f M1 M2 M3 M4 Actor 1 Actor 2 f1 f2 M3 = f1(M1) M2 = f2(M1,M4)M4 = f(M3) These assertions help identify how data is computed, but provide no information about non-functional characteristics of the computation (time, resources used, etc) Process Documentation (3)

M1 M2 M3 M4 Actor 1 Actor 2 I used 386 cluster Request sat in queue for 6min I used sparc processor I used algorithm x version x.y.z Process Documentation (4)

Types of p-assertions (1) Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message I received M1, M4 I sent M2, M3

Types of p-assertions (2) Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data) M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M3 = f1(M1) M2 = f2(M1,M4)

Types of p-assertions (3) Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x.y.z

Data flow Interaction p-assertions allow us to specify a flow of data between actors Relationship p-assertions allow us to characterise the flow of data “inside” an actor Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

Provenance Architecture

Interfaces to Provenance Store Application Results Provenance Store Record Documentation of Execution Query and Reason over Provenance of Data Administer Store and its contents

P-Assertion schemas

The p-structure (1) The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors Hierarchical Indexed by interactions (interaction= 1 message exchange) Sender’s view Receiver’s view

The p-structure (2) All p-assertions asserted by a given actor participating in an interaction Asserter identity

Recording Protocol (Groth04-06) Abstract machines DS Properties Termination Liveness Safety Statelessness Documentation Properties Immutability Attribution Datatype safety Foundation for adding necessary cryptographic techniques

Querying Functionality (Miles06) Process Documentation Query Interface: allows for “navigation” of the documentation of execution Allows us to view the provenance store (i.e. the p- structure) as if containing XML data structures Independent of technology used for running application and internal store representation Seamless navigation of application dependent and application independent process documentation

Querying Functionality (Miles06) Provenance Query Interface: allows us to obtain the provenance of some specific data A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest Hence, provenance is seen as the result of a query: Identify a piece of data at a specific execution point Scope of the process of interest: Filter in/out p-assertions according to actors, process, types of relationships, etc

Available Software PReServ (Paul Groth & Simon Miles) Offer recording and querying interfaces Available from OGSA-DAI based version available from Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)

Provenance Store Components PStoreDatabase OGSA-DAI Client API ProvenanceStoreResource PStoreDatabase OGSA-DAI Client API ProvenanceStoreResource eXist XML Database OGSA-DAI Globus GT4 Container Globus GT4 Container External Security Services ProvenanceStoreFactory Factory PStoreDatabase OGSA-DAI Client API ProvenanceServiceResource ProvenanceServiceResourceHome Uses Manages ProvenanceService Destroy Record PQuery XPath XQuery Iterate Resources Actor CSL Slide from John Ibbotson

ProvenanceStoreFactory Provenance Store Security Provenance GT4 Container Policy Decision Point Policy Decision Point ACL File (XML) Approve Request Deny ProvenanceService Destroy Record PQuery XPath XQuery Iterate Resources Factory Actor CSL Slide from John Ibbotson

Provenance Implementation The Client Side Library exposes Provenance Store functionality and separates Actor from alternative Server side implementations EU Provenance project implementation PASOA PreServ Security is being extended to allow federation using Globus Community Authorization Service (CAS) Slide from John Ibbotson

Standardisation

Standardisation Options APIs Programmatic inter-op Recording and querying Interfaces Service inter-op Provenance Model Data inter-op

Purpose of Standardisation Application Provenance Stores Record Documentation of Execution Application Allow for multiple applications to document their execution. Applications may be running in different institutions.

Purpose of Standardisation Application Provenance Store Record Documentation of Execution Allow for multiple stores from multiple IT providers Provenance Store Provenance Store

Purpose of Standardisation Provenance Store Query Provenance of Data Allow for multiple stores from multiple IT providers Provenance Store

Purpose of Standardisation Allow for legacy, monolithic applications to expose their contents (according to standard schema) Convert in standard data format

Purpose of Standardisation Allow third parties to host provenance stores, which are trusted by application owners but also auditors Application Provenance Store

Compliance Oriented Architectures Separate execution documentation from compliance verification Allows for multiple compliance verifications Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). Approach is suitable for e- scientific peer-reviewing and business compliance verification

Standardisation Philosophy Thin layer common between systems: extensible data model Model can be extended for specific: technologies (WS, Web, …), or application domains (Bio, Healthcare, Desktop, …) Service interfaces

WS-Prov-Intro WS-Prov-DM WS-Prov-Glo WS-Prov-RecWS-Prov-Query WS-Prov-DM-Link WS-Prov-DM-Infer WS-Prov-DM-DS Generic ProfilesDomain Specific Profiles WS-Prov-SOAP Technology Bindings WS-Prov-DM-Sec WS-Prov-WWW WS-Prov-DM-Rel WS-Prov-Primer Proposed List of Specifications

Provenance Queries (Miles’06)

Example Application GUI Averager Divider Store 1. average (7, 5) 2. divide (12, 2) 3. answer (6) 4. answer (6) 5. store (“6”, file1) Averager(in1,in2) { return (in1+in2)/2; } Averager delegates the division operation to the service Divider

Example Application GUI Averager Divider Store 1. average (7, 5) 2. divide (12, 2) 3. answer (6) 4. answer (6) 5. store (“6”, file1) Relationships 12in msg 2 is sum of7, 5in msg 1 6in msg 3 is division of12, 2in msg 2 6in msg 4 is copy of6in msg 3 6in msg 4 is average of7, 5in msg 1 6in msg 6 is copy of6in msg 4 Tracers are used to demarcate activities (aka sets of services) added by Averager in call to Divider returned by Divider in response

The data we want to find the provenance of Identify the event where the entity is documented: In this case, the event is the receipt of a request to store the data in file named file1 Identify the data entity within that message In this case, the data of interest is the “6” stored in file1 “file1” Store

Provenance Graph “6” StoreGUI “6” GUIAverager “6” AveragerDivider “12” DividerAverager “2” DividerAverager “5” AveragerGUI “7” AveragerGUI Copy of Division of DividendDivisor Sum of Average of

Scoped Provenance Graph “6” StoreGUI “6” GUIAverager “6” AveragerDivider “12” DividerAverager “2” DividerAverager “5” AveragerGUI “7” AveragerGUI Copy of Division of DividendDivisor Sum of Average of Filter to exclude “Average of” relationships Allows us to ignore the high level structure of the computation and to focus on the actual operations e.g. allows us to establish what a given provider actually does

Scoped Provenance Graph “6” StoreGUI “6” GUIAverager “6” AveragerDivider “12” DividerAverager “2” DividerAverager “5” AveragerGUI “7” AveragerGUI Copy of Division of DividendDivisor Sum of Average of Filter to exclude messages containing tracer This is equivalent to hiding the internal operation of Averager Allows us to consider a given service (and all its inferior invocations) as a black box: high level account of provenance e.g. no detail should be provided about the internals of Averager

Scoped Provenance Graph “6” StoreGUI “6” GUIAverager “6” AveragerDivider “12” DividerAverager “2” DividerAverager “5” AveragerGUI “7” AveragerGUI Copy of Division of DividendDivisor Sum of Average of Filter to exclude Divisor parameters Allows us to scope the provenance graph according to types of data or operations e.g. looking at the restorations of a painting rather than its various owners

Provenance Query

Practically … Event and Data Identification //ps:interactionRecord [ps:interactionKey/ps:messageSink/ wsa:EndpointReference/ wsa:Address=" The interaction record in which the receiver (messageSink) has address //ps:interactionPAssertion [ex:envelope/ex:store/ex:location="/home/sm/data/file1"] //ex:envelope/ex:store/ex:data Event identification Data identification

Practically … The scope of the provenance query Unscoped query / Exclude ‘averageOf’ relation /pq:relationshipTarget[ps:relation!= " Exclude tracer introduced by Averager /pq:relationshipTarget/ps:interactionPAssertion [not(ex:envelope/ph:pheader/ ph:interactionMetaData [ph:tracer="process://sub/1"])]

Provenance of Donor Diagnosis Request Diagnose Request Decision Maker Donor Data Collector Diagnose Request Donor Data Collector Brain Death Manager Brain Death Manager Testing Lab Data Collection Complete Brain Death Manager Healthcare Record Manager Patient (in Brain Death Notification) Brain Death Manager User Interface Was Caused By Is Diagnosis Request For Patient Test Results Test Results Donor Data Collection Data Collection Request Healthcare Record Manager Donor Data Collector EHCR Healthcare Record Manager EHCRS EHCR Request EHCRS Healthcare Record Manager Includes Data Is Response To Was Caused By Is Response To

Conclusions

Provenance Store Record To Sum Up Query Compliance check Rerun/Reproduce Analyse Standardising the documentation of Business Processes Provenance Architecture Methodology Apply Healthcare Distribution Finance Aerospace Automobile Pharmaceutical Slide from John Ibbotson

Conclusions Crucial topic for many applications Full architectural specification An implementation available for download Methodology to make application provenance-aware

twiki.ipaw.info Provenance Challenge

Publications 1. Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In, Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.

Questions