Presentation is loading. Please wait.

Presentation is loading. Please wait.

July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,

Similar presentations


Presentation on theme: "July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,"— Presentation transcript:

1 July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter Zauner and Luc Moreau University of Southampton

2 July 27, 2005High Performance Distributed Computing 05 Outline Biology The Workflow Use Cases Provenance Implementation Evaluation Conclusion

3 July 27, 2005High Performance Distributed Computing 05 Biology Determine how protein sequences (chains of amino acids) fold into a 3D structure? Which part of DNA translates into one protein sequence? Structure of protein sequences may help to answer these questions. Structure can be quantified by textual compressibility. Determine the amino acid groupings that maximize compressibility?

4 July 27, 2005High Performance Distributed Computing 05 The Workflow Get Sequences Make a Sample Recode Sample Compress and Measure Shuffle the sample Compress and Measure each permutation Collate all measures Produce the average compressibility

5 July 27, 2005High Performance Distributed Computing 05 Use Case (1) A bioinformatician, A, downloads sequence data of microbial proteins from the database RefSeq. Runs the compressibility experiment. A later performs the same experiment on the same sequence data, again downloaded from RefSeq. A compares the two experiment results and notices a difference. A determines whether the difference was caused by the algorithms changing

6 July 27, 2005High Performance Distributed Computing 05 Use Case (2) A bioinformatician performs an experiment on a FASTA sequence encoding a protein. A reviewer, later determines whether or not the sequence was in fact processed by a service that meaningfully processes protein sequences only.

7 July 27, 2005High Performance Distributed Computing 05 Provenance Use case’s related to process Provenance Definition:  The provenance of a result is the process that led to that result. o This is a conceptual definition.

8 July 27, 2005High Performance Distributed Computing 05 Documentation of Process Conceive a computer based representation of provenance We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or partial; it can be accurate or inaccurate; it can present conflicting or consensual views of the actors involved; it can provide operational details of execution or it can be abstract.

9 July 27, 2005High Performance Distributed Computing 05 Heterogeneity This is a heterogeneous application Has shell scripts, java programs, web services Heterogeneity is common in Grid based apps LCG Atlas - Athena & VDT coexist Support for plugging-in different execution environments

10 July 27, 2005High Performance Distributed Computing 05 Provenance “Lifecycle” Application Results Provenance Store Record Documentation of Process Query to retrieve the provenance of a result

11 July 27, 2005High Performance Distributed Computing 05 Use Case 1: Do services differ between experiments? Provenance Store Retrieve documentation of experiments Service A ………. ……… …………….. Service A ………. ……… …………….. …. Highlight differences in services between experiments

12 July 27, 2005High Performance Distributed Computing 05 Implementation  Implemented as a VDT workflow  Scheduled by Condor  Each service, script, command records process documentation into a provenance store.  Uses PReServ: a web services implementation of a provenance store

13 July 27, 2005High Performance Distributed Computing 05 Axis Handler Axis Handler Provenance Service Backend Store Interface Database Store In-Memory Store … Backend Stores PS Client Side Library PS Client Side Library Web Service WS Client Query Actor WS PS Client Side Library WS Calls Java Calls PReServ Implementation Diagram

14 July 27, 2005High Performance Distributed Computing 05 Evaluation Deployment Runs on VMWare deployment consistency ease of development Workflow is executed on one machine PReServ runs on another machine

15 July 27, 2005High Performance Distributed Computing 05 Recording Performance

16 July 27, 2005High Performance Distributed Computing 05 Query Performance

17 July 27, 2005High Performance Distributed Computing 05 Both recording and query times are linear 10% overhead for asynchronous recording Our provenance concept / system are grounded in a number of use cases The experiment is ready to be moved to a cluster or a grid Southampton Cluster A Grid Will allow us to test scalability Conclusion

18 July 27, 2005High Performance Distributed Computing 05 Contact Info Paul Groth pg03r@ecs.soton.ac.uk www.pasoa.org - use case descriptions - papers - PReServ software

19 July 27, 2005High Performance Distributed Computing 05 Configuration Redhat Linux 9.1 on VMWare on Windows XP Pentium P4 2.8 GHZ 1.5 GB RAM PReServ on another machine Database backend Berkley JDB 100 Mb local ethernet


Download ppt "July 27, 2005High Performance Distributed Computing 05 Recording and Using Provenance in a Protein Compressibility Experiment Paul Groth, Simon Miles,"

Similar presentations


Ads by Google