Presentation is loading. Please wait.

Presentation is loading. Please wait.

Provenance: an open approach to experiment validation in e-Science

Similar presentations


Presentation on theme: "Provenance: an open approach to experiment validation in e-Science"— Presentation transcript:

1 Provenance: an open approach to experiment validation in e-Science
Professor Luc Moreau University of Southampton

2 Provenance & PASOA Teams
University of Southampton Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen IBM UK (EU Project Coordinator) John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari Universitad Politecnica de Catalunya (UPC) Steven Willmott, Javier Vazquez SZTAKI Laszlo Varga, Arpad Andics, Tamas Kifor German Aerospace Andreas Schreiber, Guy Kloss, Frank Danneman

3 Contents Motivation Provenance Concepts Provenance Architecture
Standardisation Provenance Queries Conclusions

4 Motivation

5 Scientific Research Academic Peer Review

6 Business Regulations Audit (Sarbanes-Oxley) Audit (Basel II)
Accounting Audit (Basel II) Banking

7 Accounting Audit (Sabanes-Oxley)

8 Banking Audit (Basel II)

9 Health Care Management
European Recommendation R(97)5: on the protection of medical data

10 e-Science datasets How to undertake peer-reviewing and validation of e-Scientific results?

11 Sarbanes-Oxley The American Competitiveness and Corporate Accountability Act of 2002, commonly known as the Sarbanes-Oxley Act, was signed into law on July 30, 2002. The law is intended to protect investors by improving the accuracy and reliability of corporate disclosures. Sarbanes-Oxley also defines a higher level of responsibility, accountability, and financial reporting transparency - changes that are intended to return confidence to investors, as well.

12 Food & Drug Administration

13 Basel II

14 Compliance to Regulations
The “next-compliance” problem Can we be certain that by ensuring compliance to a new regulation, we do not break previous compliance?

15 Current Solutions Proprietary, Monolithic Silos, Closed
Do not inter-operate with other applications Not adaptable to new regulations

16 Provenance Oxford English Dictionary: Concept vs representation
the fact of coming from some particular source or quarter; origin, derivation the history or pedigree of a work of art, manuscript, rare book, etc.; concretely, a record of the passage of an item through its various owners. Concept vs representation

17 Provenance in Computer Systems
Our definition of provenance in the context of applications for which process matters to end users: The provenance of a piece of data is the process that led to that piece of data Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

18 Our Approach Define core concepts pertaining to provenance
Specify functionality required to become “provenance-aware” Define open data models and protocols that allow systems to inter-operate Standardise data models and protocols Provide a reference implementation Provide reasoning capability

19 Context (1) Aerospace engineering: maintain a historical record of design processes, up to 99 years. Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

20 Context (2) Bioinformatics: verification and
auditing of “experiments” (e.g. for drug approval) High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

21 Provenance Concepts

22 Provenance “Lifecycle”
Core Interfaces to Provenance Store Provenance “Lifecycle” Application Data Results Record Documentation of Execution Provenance Store Administer Store and its contents Query and Reason over Provenance of Data

23 Nature of Documentation
We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual views of the actors involved; it can provide operational details of execution or it can be abstract.

24 p-assertion A given element of process documentation will be referred to as a p-assertion p-assertion: is an assertion that is made by an actor and pertains to a process.

25 Service Oriented Architecture
Broad definition of service as component that takes some inputs and produces some outputs. Services are brought together to solve a given problem typically via a workflow definition that specifies their composition. Interactions with services take place with messages that are constructed according to services interface specification. The term actor denotes either a client or a service in a SOA. A process is defined as execution of a workflow

26 Process Documentation (1)
From these p-assertions, we can derive that M3 was sent by Actor 1 and received by Actor 2 (and likewise for M4) Actor 1 Actor 2 M1 If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages M3 M4 M2 I received M1, M4 I sent M2, M3 I received M3 I sent M4

27 Process Documentation (2)
Actor 1 Actor 2 M1 M3 These assertions help identify order of messages, but not how data was computed M4 M2 M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M4 is in reply to M3

28 Process Documentation (3)
Actor 1 Actor 2 M1 f1 These assertions help identify how data is computed, but provide no information about non-functional characteristics of the computation (time, resources used, etc) f2 M3 f M4 M2 M3 = f1(M1) M2 = f2(M1,M4) M4 = f(M3)

29 Process Documentation (4)
Actor 1 Actor 2 M1 M3 M4 M2 I used 386 cluster Request sat in queue for 6min I used sparc processor I used algorithm x version x.y.z

30 Types of p-assertions (1)
Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message I received M1, M4 I sent M2, M3

31 Types of p-assertions (2)
Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data) M2 is in reply to M1 M3 is caused by M1 M2 is caused by M4 M3 = f1(M1) M2 = f2(M1,M4)

32 Types of p-assertions (3)
Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction I used sparc processor I used algorithm x version x.y.z

33 Data flow Interaction p-assertions allow us to specify a flow of data between actors Relationship p-assertions allow us to characterise the flow of data “inside” an actor Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

34 Provenance Architecture

35 Interfaces to Provenance Store
Application Results Record Documentation of Execution Provenance Store Administer Store and its contents Query and Reason over Provenance of Data

36

37 P-Assertion schemas

38 The p-structure (1) The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors Hierarchical Indexed by interactions (interaction= 1 message exchange) Sender’s view Receiver’s view

39 The p-structure (2) Asserter identity All p-assertions
asserted by a given actor participating in an interaction

40 Recording Protocol (Groth04-06)
Abstract machines DS Properties Termination Liveness Safety Statelessness Documentation Properties Immutability Attribution Datatype safety Foundation for adding necessary cryptographic techniques

41 Querying Functionality (Miles06)
Process Documentation Query Interface: allows for “navigation” of the documentation of execution Allows us to view the provenance store (i.e. the p-structure) as if containing XML data structures Independent of technology used for running application and internal store representation Seamless navigation of application dependent and application independent process documentation

42 Querying Functionality (Miles06)
Provenance Query Interface: allows us to obtain the provenance of some specific data A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest Hence, provenance is seen as the result of a query: Identify a piece of data at a specific execution point Scope of the process of interest: Filter in/out p-assertions according to actors, process, types of relationships, etc

43 Available Software PReServ (Paul Groth & Simon Miles)
Offer recording and querying interfaces Available from Soon ogsa-dai based version available from Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)

44 Standardisation

45 Standardisation Options

46 Purpose of Standardisation
Application Application Record Documentation of Execution Provenance Stores Allow for multiple applications to document their execution. Applications may be running in different institutions.

47 Purpose of Standardisation
Application Record Documentation of Execution Provenance Store Provenance Store Provenance Store Allow for multiple stores from multiple IT providers

48 Purpose of Standardisation
Provenance Store Provenance Store Query Provenance of Data Allow for multiple stores from multiple IT providers

49 Purpose of Standardisation
Convert in standard data format Allow for legacy, monolithic applications to expose their contents (according to standard schema)

50 Purpose of Standardisation
Provenance Store Application Allow third parties to host provenance stores, which are trusted by application owners but also auditors

51 Compliance Oriented Architectures
Separate execution documentation from compliance verification Allows for multiple compliance verifications Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting). Approach is suitable for e-scientific peer-reviewing and business compliance verification

52 Provenance Queries (Miles’06)

53 Example Application GUI Averager Divider Averager(in1,in2) {
4. answer (6) 3. answer (6) Averager(in1,in2) { return (in1+in2)/2; } Averager delegates the division operation to the service Divider 5. store (“6”, file1) Store

54 Example Application GUI Averager Divider Relationships
4. answer (6) 3. answer (6) Relationships 12 in msg 2 is sum of 7, 5 in msg 1 6 in msg 3 is division of 12, 2 in msg 2 6 in msg 4 is copy of 6 in msg 3 6 in msg 4 is average of 7, 5 in msg 1 6 in msg 6 is copy of 6 in msg 4 Tracers are used to demarcate activities (aka sets of services) added by Averager in call to Divider returned by Divider in response 5. store (“6”, file1) Store

55 The data we want to find the provenance of
Identify the event where the entity is documented: In this case, the event is the receipt of a request to store the data in file named file1 Identify the data entity within that message In this case, the data of interest is the “6” stored in file1 “file1” Store

56 Provenance Graph “7” “5” GUI Averager GUI Averager Sum of “12” “2”
Divider Averager Divider Divisor Dividend Division of “6” Divider Averager Average of Copy of “6” Averager GUI Copy of “6” GUI Store

57 Scoped Provenance Graph
“7” “5” GUI Averager GUI Averager Allows us to ignore the high level structure of the computation and to focus on the actual operations e.g. allows us to establish what a given provider actually does Sum of “12” “2” Averager Divider Averager Divider Divisor Dividend Division of “6” Divider Averager Average of Copy of Filter to exclude “Average of” relationships “6” Averager GUI Copy of “6” GUI Store

58 Scoped Provenance Graph
“7” “5” GUI Averager GUI Averager Allows us to consider a given service (and all its inferior invocations) as a black box: high level account of provenance e.g. no detail should be provided about the internals of Averager Sum of “12” “2” Averager Divider Averager Divider Divisor Dividend Division of “6” Divider Averager Average of Copy of Filter to exclude messages containing tracer This is equivalent to hiding the internal operation of Averager “6” Averager GUI Copy of “6” GUI Store

59 Scoped Provenance Graph
“7” “5” GUI Averager GUI Averager Allows us to scope the provenance graph according to types of data or operations e.g. looking at the restorations of a painting rather than its various owners Sum of “12” “2” Averager Divider Averager Divider Divisor Dividend Division of “6” Divider Averager Average of Copy of Filter to exclude Divisor parameters “6” Averager GUI Copy of “6” GUI Store

60 Provenance Query

61 Practically … Event and Data Identification //ps:interactionRecord
[ps:interactionKey/ps:messageSink/ wsa:EndpointReference/ wsa:Address=" The interaction record in which the receiver (messageSink) has address //ps:interactionPAssertion [ex:envelope/ex:store/ex:location="/home/sm/data/file1"] //ex:envelope/ex:store/ex:data Event identification Data identification

62 Practically … The scope of the provenance query /
Unscoped query / Exclude ‘averageOf’ relation /pq:relationshipTarget[ps:relation!= " Exclude tracer introduced by Averager /pq:relationshipTarget/ps:interactionPAssertion [not(ex:envelope/ph:pheader/ ph:interactionMetaData [ph:tracer="process://sub/1"])]

63 Provenance of Donor Diagnosis Request
Data Collection Request Provenance of Donor Diagnosis Request Donor Data Collector Healthcare Record Manager Was Caused By EHCR Request Healthcare Record Manager EHCRS Is Response To Is Response To EHCR EHCRS Healthcare Record Manager Includes Data Data Collection Complete Patient (in Brain Death Notification) Test Results User Interface Brain Death Manager Healthcare Record Manager Brain Death Manager Testing Lab Brain Death Manager Starting from the bottom, the diagnosis request is made by the Donor Data Collector to the Decision Maker triggered by the Brain Death Manager determining that adequate data is available about the data to start diagnosis. The Brain Death Manager makes this decision on the basis of three sources of information (see the relationship with three input parameters in the middle of the figure): (1) the notification of a patient's brain death from a doctor (via User Interface), (2) the Donor Data Collector having obtained the most up to date EHCR for the donor, and (3) the test results on the patient's blood having been completed by the Testing Lab. The EHCR data collection by the Donor Data Collector is triggered by the request from the doctor sometime prior to brain death, i.e. when they think the patient is going to die soon (see relationship at top of slide). I'll try and finish the slide off today and get it to you. Thanks, Simon Here's an improved version of the slide. There are two major changes from the last version. First, the process of retrieving the Electronic HealthCare Record is expanded to make a more interesting graph. The initial request to collect donor data leads to a request to the EHCR System, which responds with the ECHR for the patient. Second, I've added two tracers. The red tracer demarcates all the processing of the Donor Data Collector actor. The yellow tracer demarcates the processing of the Healthcare Record Manager actor in gathering the ECHR. Donor Data Collection Patient Test Results Is Diagnosis Request For Diagnose Request Brain Death Manager Donor Data Collector Was Caused By Diagnose Request Donor Data Collector Decision Maker

64 Conclusions

65 To Sum Up Provenance Architecture Methodology Compliance check
Finance Distribution Aerospace Standardising the documentation of Business Processes Healthcare Provenance Architecture Methodology Apply Automobile Provenance Store Record Pharmaceutical Compliance check Rerun/Reproduce Analyse Query Slide from John Ibbotson

66 Conclusions Crucial topic for many applications
Full architectural specification An implementation available for download Methodology to make application provenance-aware

67

68 Publications Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005. Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004. Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004. Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005. Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In , 2003. Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005. Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.

69 Questions

70 OTM Application


Download ppt "Provenance: an open approach to experiment validation in e-Science"

Similar presentations


Ads by Google