VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos.

Slides:



Advertisements
Similar presentations
GridVine: Building Internet-Scale Semantic Overlay Networks By Lan Tian.
Advertisements

UTPB: A Benchmark for Scientific Workflow Provenance Storage and Querying Systems Artem Chebotko Joint work with E. De Hoyos, C. Gomez, A. Kashlev, X.
ASCR Data Science Centers Infrastructure Demonstration S. Canon, N. Desai, M. Ernst, K. Kleese-Van Dam, G. Shipman, B. Tierney.
The Virtual Estuary: Simulation meets Visualization Yvette Spitz Scott Durski Erik Anderson Joel Daniels Juliana Freire Claudio Silva Antonio Baptista.
Using Provenance to Support Real-Time Collaborative Design of Workflows Workflow evolution provenance and OPM Tommy Ellkvist and Juliana Freire.
VisTrails Provenance In VisIt David Koop. VisIt Basics ‘Turn-key’ visualization application Viewer GUI Plots / Operators.
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
VisTrails: Overview Juliana Freire University of Utah Joint work with: Erik Andersen, Steven P. Callahan, David Koop, Emanuele.
Page 1 Multidatabase Querying by Context Ramon Lawrence, Ken Barker Multidatabase Querying by Context.
Universe Design Concepts Business Intelligence Copyright © SUPINFO. All rights reserved.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
1 Overview of Database Federation and IBM Garlic Project Presented by Xiaofen He.
Advance Computer Programming Java Database Connectivity (JDBC) – In order to connect a Java application to a database, you need to use a JDBC driver. –
Biology.sdsc.edu CIPRes in Kepler: An integrative workflow package for streamlining phylogenetic data analyses Zhijie Guan 1, Alex Borchers 1, Timothy.
New Task Group CRIS Architecture & Development Maximilian Stempfhuber RWTH Aachen University Library
Using Provenance to Support Real-Time Collaborative Design of Workflows Tommy Ellkvist 1, Erik Anderson 2, David Koop 2, Juliana Freire 2, and Claudio.
January, 23, 2006 Ilkay Altintas
Information Integration Intelligence with TopBraid Suite SemTech, San Jose, Holger Knublauch
Database System Concepts and Architecture Lecture # 3 22 June 2012 National University of Computer and Emerging Sciences.
GRID job tracking and monitoring Dmitry Rogozin Laboratory of Particle Physics, JINR 07/08/ /09/2006.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Peter Fox CSCI Week 9, October 27, 2008.
Students: Avner Barr, Eitan Vadergorn Supervisor: Rami Mankevich Lecturer in Charge: Chaim Gotsman.
1 Dr. Markus Hillenbrand, ICSY Lab, University of Kaiserslautern, Germany A Generic Database Web Service for the Venice Service Grid Michael Koch, Markus.
Lecture 01: Introduction September 5, 2012 COMP Visual Analytics and Provenance.
Dali JPA Tools. About Dali Dali JPA Tools is an Eclipse Web Tools Platform sub-Project Dali 1.0 is a part of WTP 2.0 Europa coordinated release Goal -
SOFSEM-SRF 2006, January 21-26, Merin, Czech Republic R. Adamus,K. Kuliberda, J. Wislicki, K. Subieta Wrapping Relational Data Structures to Object-Oriented.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness TA Weijing Chen Semantic eScience Week 10, November 7, 2011.
1 Foundations V: Infrastructure and Architecture, Middleware Deborah McGuinness and Joanne Luciano With Peter Fox and Li Ding CSCI Week 10, November.
Entity Framework Overview. Entity Framework A set of technologies in ADO.NET that support the development of data-oriented software applications A component.
XML & Mediators Thitima Sirikangwalkul Wai Sum Mong April 10, 2003.
XML Registries Source: Java TM API for XML Registries Specification.
Information System Development Courses Figure: ISD Course Structure.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Active Directory Maryam Izadi. Topics Covered NT Vs 2000/2003 Active Directory LDAP MMC.
ILDG Middleware Status Chip Watson ILDG-6 Workshop May 12, 2005.
Object Persistence Design Chapter 13. Key Definitions Object persistence involves the selection of a storage format and optimization for performance.
1 © 1999 Microsoft Corp.. Microsoft Repository Phil Bernstein Microsoft Corp.
3 Copyright © 2009, Oracle. All rights reserved. Accessing Non-Oracle Sources.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
ATLAS Grid Data Processing: system evolution and scalability D Golubkov, B Kersevan, A Klimentov, A Minaenko, P Nevski, A Vaniachine and R Walker for the.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Efficient RDF Storage and Retrieval in Jena2 Written by: Kevin Wilkinson, Craig Sayers, Harumi Kuno, Dave Reynolds Presented by: Umer Fareed 파리드.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
© Geodise Project, University of Southampton, Knowledge Management in Geodise Geodise Knowledge Management Team Barry Tao, Colin Puleston, Liming.
ReproZip Packing Experiments for Sharing and Publication Fernando Chirigati, Juliana Freire | NYU-Poly Dennis Shasha | NYU.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.
Introduction to OOAD & Rational Rose cyt. 2 Outline RUP OOAD Rational Rose.
The Mint Mapping tool The MoRe aggregator Vassilis Tzouvaras, Dimitris Gavrilis National Technical University of Athens Digital Curation Unit - IMIS, Athena.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
A PPARC funded project Common Execution Architecture Paul Harrison IVOA Interoperability Meeting Cambridge MA May 2004.
Data Integration in Bioinformatics Using OGSA-DAI The BioDA Project Shirley Crompton, Brian Matthews (CCLRC) Alex Gray, Andrew Jones, Richard White (Cardiff.
1 Advanced Software Architecture Muhammad Bilal Bashir PhD Scholar (Computer Science) Mohammad Ali Jinnah University.
1 Open Ontology Repository initiative - Planning Meeting - Thu Co-conveners: PeterYim, LeoObrst & MikeDean ref.:
Object storage and object interoperability
© Geodise Project, University of Southampton, Integrating Data Management into Engineering Applications Zhuoan Jiao, Jasmin.
By Chokchai Phatharamalai, Kan Ouivirach, Nattanicha Rittammanart, Perayos Supajaroonwong, Sarawoot Kongyoung, Thammathip.
LCG Distributed Databases Deployment – Kickoff Workshop Dec Database Lookup Service Kuba Zajączkowski Chi-Wei Wang.
Provenance Research BIBI RAJU, TODD ELSETHAGEN, ERIC STEPHAN 1 Pacific Northwest National Laboratory, Richland, WA.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
Ganga/Dirac Data Management meeting October 2003 Gennady Kuznetsov Production Manager Tools and Ganga (New Architecture)
The AstroGrid-D Information Service Stellaris A central grid component to store, manage and transform metadata - and connect to the VO!
Lecture Transforming Data: Using Apache Xalan to apply XSLT transformations Marc Dumontier Blueprint Initiative Samuel Lunenfeld Research Institute.
Infrastructure and Workflow for the Formal Evaluation of Semantic Search Technologies Stuart N. Wrigley 1, Raúl García-Castro 2 and Cassia Trojahn 3 1.
Open Source distributed document DB for an enterprise
Middleware independent Information Service
Phil Bernstein Microsoft Corp.
Presentation transcript:

VisTrails Second Provenance Challenge Tommy Ellkvist David Koop Juliana Freire Joint work with: Erik Andersen, Steven P. Callahan, Emanuele Santos, Carlos E. Scheidegger, Cláudio Silva, and Huy T. Vo

Outline VisTrails Introduction VisTrails Demo Provenance Model and API Challenge Results Issues and Future Work

VisTrails Comprehensive provenance infrastructure for computational tasks Support for exploratory tasks such as visualization and data mining Workflows are iteratively refined as users generate and test hypotheses New change-based provenance model Uniformly captures data and workflow provenance

Change-based Provenance Provenance is stored as a tree of actions add module add connection

Provenance: Storing Actions Each change writes new actions to the tree

Change-based Provenance Data provenance: where does a specific data product come from? Workflow evolution: how has workflow structure changed over time? Treat workflow versions as data–store provenance of workflows

Layered Provenance

VisTrails Provenance Normalized information–no redundancy! Each layer provides more specific information but refers to parent layers Workflow Evolution  Workflow  Execution Extensible storage options Support for both relational and XML Flexible annotation framework–users can specify application-specific provenance information

Provenance for Reproducibility and Beyond Infrastructure for querying and reusing provenance Query workflows by example Create workflows by analogy Collaborative exploration Scalable derivation of data products

VisTrails Demo

Supporting Different Provenance Backends VisTrails has powerful tools to query and reuse provenance information There are many powerful workflow systems that produce such information Problem: How to integrate different provenance backends? Our approach: A mediation-based approach to provenance interoperability

Mediator Architecture Mapping from global schema to data source specific schema

Mediated Provenance Mapping from general model to engine- specific model

Combining Provenance Establish model Produce an API for this model Wrap provenance access for each system so that queries become native over their provenance data

Provenance Model Follows the layered architecture Versions map to a workflows Workflows are modeled as graphs Parameters capture module state User-defined annotations are available at each layer of the model Module Definition stores information about the computational pieces

Provenance Model

Provenance API Implements common access queries and operations over the provenance model Examples: getParent(module) getChildren(module) getUpstream(module) getDownstream(module) getAnnotations(module | workflow | …) getDataItems(module_exec) getParameters(module) getVersion(time) getExecutedModules(workflow) getConnection(data_item) getPorts(connection) findModulesByParameter(search_params) findModulesByAnnotation(search_params) findExecutionsByAnnotation(search_params) findVersionsByModules(search_params)

Provenance API Example getExecutedModules(wf_exec) VisTrails (XPath) def getExecutedModules(self, wf_exec): newdataitems = [] q = + wf_exec.pid.key + dataitems = self.logcontext.xpathEval(q) Pasoa (XPath) def getExecutedModules(self, wf_exec): q = "//ps:relationshipPAssertion[ps:localPAssertionId='" + wf_exec.pid.key + "']/ps:relation" dataitems = self.context.xpathEval(q) Taverna (SPARQL) def getExecutedModules(self, wf_exec): " " q = ''' SELECT ?mi FROM WHERE { ?mi } ''' return self.processQueryAsList(q, pModuleInstance)

Provenance API Results Implemented queries for each system and a combination of all three Annotation issues for a couple queries Example: Query 1 Results vt3:4 --> vt3:7 vt3:1 --> vt3:4 vt3:0 --> vt3:1 pas2: --> vt3:0 myg1:urn: --> pas2: myg1:urn: --> pas2: myg1:urn: --> pas2: myg1:urn: --> pas2: myg1:urn: --> myg1:urn: myg1:urn: --> myg1:urn: myg1:urn: --> myg1:urn: myg1:urn: --> myg1:urn:

Provenance API Integration Developed VisTrails Provenance Query Language for first challenge Plan to integrate API with query language Plan to integrate query language with VisTrails interfaces

Interoperability Issues Uniquely identifying intermediate results Intermediate file names were not specified and varied Tracing ids is difficult for users–this should be transparent A common query language should use concepts familiar to users Mediator vs. Warehousing approach

Performance Issues Redundant information can make queries inefficient What is the best storage backend? RDBMS vs. XML database? What is the best data model? XML vs. Relational vs. RDF? Need good benchmarks–large data!

Questions?

Mediated Provenance User queries General Provenance Model wrapper Taverna Mapping from generic provenance model into the models of different systems Pasoa… Prov API

Mediator Architecture User SQL/ODBC queries Mediator Global Schema wrapper Data Source Mapping from global schema into source schemas Data Source Data Source