IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame,

Slides:



Advertisements
Similar presentations
OMII-UK Steven Newhouse, Director. © 2 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its.
Advertisements

Querying Workflow Provenance Susan B. Davidson University of Pennsylvania Joint work with Zhuowei Bao, Xiaocheng Huang and Tova Milo.
Provenance GGF18 Kepler/COW+RWS, Kepler/COW+RWS, Bowers, McPhiilips et al. Provenance Management in a COllection-oriented Scientific Workflow.
June 22-23, 2005 Technology Infusion Team Committee1 High Performance Parallel Lucene search (for an OAI federation) K. Maly, and M. Zubair Department.
Object-Oriented Analysis and Design
Building Scientific Workflows with Taverna and BPEL: a Comparative Study in caGrid Wei Tan 1, Paolo Missier 2, Ravi Madduri 1, Ian Foster 1 1 University.
A First Attempt towards a Logical Model for the PBMS PANDA Meeting, Milano, 18 April 2002 National Technical University of Athens Patterns for Next-Generation.
Integrated Scientific Workflow Management for the Emulab Network Testbed Eric Eide, Leigh Stoller, Tim Stack, Juliana Freire, and Jay Lepreau and Jay Lepreau.
Components and Architecture CS 543 – Data Warehousing.
FREMA: e-Learning Framework Reference Model for Assessment Design Patterns for Wrapping Similar Legacy Systems with Common Service Interfaces Yvonne Howard.
1 Service Discovery using Diane Service Descriptions Ulrich Küster and Birgitta König-Ries University Jena Germany
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
CIS607, Fall 2005 Semantic Information Integration Article Name: Clio Grows Up: From Research Prototype to Industrial Tool Name: DH(Dong Hwi) kwak Date:
Mapping Physical Formats to Logical Models to Extract Data and Metadata Tara Talbott IPAW ‘06.
Chapter 10 Architectural Design
Database Systems Group Department for Mathematics and Computer Science Lars Hamann, Martin Gogolla, Mirco Kuhlmann OCL-based Runtime Monitoring of JVM.
1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.
Deciding Semantic Matching of Stateless Services Duncan Hull †, Evgeny Zolin †, Andrey Bovykin ‡, Ian Horrocks †, Ulrike Sattler † and Robert Stevens †
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
Architecture-Based Runtime Software Evolution Peyman Oreizy, Nenad Medvidovic & Richard N. Taylor.
Grant Number: IIS Institution of PI: Arizona State University PIs: Zoé Lacroix Title: Collaborative Research: Semantic Map of Biological Data.
Understanding the utility and fitness of Workflow Provenance for Experiment Reporting Pınar Alper, Supervisor: Carole A. Goble 1.
Introduction to MDA (Model Driven Architecture) CYT.
Applying the Semantic Web at UCHSC - Center for Computational Pharmacology Ian Wilson.
Usage of `provenance’: A Tower of Babel Luc Moreau.
Miguel Branco CERN/University of Southampton Enabling provenance on large-scale e-Science applications.
Taverna and my Grid Open Workflow for Life Sciences Tom Oinn
IPAW'08 – Salt Lake City, Utah, June 2008 Exploiting provenance to make sense of automated decisions in scientific workflows Paolo Missier, Suzanne Embury,
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
© DATAMAT S.p.A. – Giuseppe Avellino, Stefano Beco, Barbara Cantalupo, Andrea Cavallini A Semantic Workflow Authoring Tool for Programming Grids.
Aude Dufresne and Mohamed Rouatbi University of Montreal LICEF – CIRTA – MATI CANADA Learning Object Repositories Network (CRSNG) Ontologies, Applications.
SPARQL Query Graph Model (How to improve query evaluation?) Ralf Heese and Olaf Hartig Humboldt-Universität zu Berlin.
EGEE-II INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Provenance Challenge gLite Job Provenance.
Paolo Missier (1), Bertram Luda ̈ scher (2), Shawn Bowers (3), Saumen Dey (2), Anandarup Sarkar (3), Biva Shrestha (4), Ilkay Altintas (5), Manish Kumar.
The Functional Genomics Experiment Object Model (FuGE) Andrew Jones, School of Computer Science, University of Manchester MGED Society.
8 1 Chapter 8 Advanced SQL Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Quality views: capturing and exploiting the user perspective on data quality Paolo Missier, Suzanne Embury, Mark Greenwood School of Computer Science University.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu & Gagan Agrawal Enabling.
1. 2 Preface In the time since the 1986 edition of this book, the world of compiler design has changed significantly 3.
Database Concepts Track 3: Managing Information using Database.
REDUX – automatic capture, efficient storage Roger S. Barga Microsoft Research (MSR) Luciano Digiampietri University of Campinas, Sao Paolo, Brazil.
Recording the Context of Action for Process Documentation Ian Wootten Cardiff University, UK
Lineage Tracing for General Data Warehouse Transformations Yingwei Cui and Jennifer Widom Computer Science Department, Stanford University Presentation.
MyGrid/Taverna Provenance Daniele Turi University of Manchester OMII f2f Meeting, London, 19-20/4/06.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Generating Software Documentation in Use Case Maps from Filtered Execution Traces Edna Braun, Daniel Amyot, Timothy Lethbridge University of Ottawa, Canada.
Khalid Belhajjame 1, Paolo Missier 2, and Carole A. Goble 1 1 University of Manchester 2 University of Newcastle Detecting Duplicate Records in Scientific.
Faculty Advisor – Dr. Suraj Kothari Client – Jon Mathews Team Members – Chaz Beck Marcus Rosenow Shaun Brockhoff Jason Lackore.
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Knowledge Modeling and Discovery. About Thetus Thetus develops knowledge modeling and discovery infrastructure software for customers who: Have high-value.
DANIELA KOLAROVA INSTITUTE OF INFORMATION TECHNOLOGIES, BAS Multimedia Semantics and the Semantic Web.
Design-Directed Programming Martin Rinard Daniel Jackson MIT Laboratory for Computer Science.
Copyright 2007, Information Builders. Slide 1 iWay Web Services and WebFOCUS Consumption Michael Florkowski Information Builders.
Data Grid Research Group Dept. of Computer Science and Engineering The Ohio State University Columbus, Ohio 43210, USA David Chiu and Gagan Agrawal Enabling.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Data Management: Data Processing Types of Data Processing at USGS There are several ways to classify Data Processing activities at USGS, and here are some.
Of 24 lecture 11: ontology – mediation, merging & aligning.
These exercises highlight the services that do not perform biological functions, but are vital for running life science workflows.
Alan Williams, Donal Fellows, Finn Bacall,
SONATA: Query-Driven Network Telemetry
Chapter 10: Process Implementation with Executable Models
Names, Binding, and Scope
Lecture 15 (Notes by P. N. Hilfinger and R. Bodik)
Multiple Aspect Modeling of the Synchronous Language Signal
NSDL Data Repository (NDR)
– JukeBox – transparency, flexibility, speed and comfort!
Probabilistic Databases
Introduction to Data Structure
Overview of Workflows: Why Use Them?
Presentation transcript:

IPAW'08 – Salt Lake City, Utah, June 2008 Data lineage model for Taverna workflows with lightweight annotation requirements Paolo Missier, Khalid Belhajjame, Jun Zhao, Carole Goble School of Computer Science The University of Manchester, UK

IPAW'08 – Salt Lake City, Utah, June 2008 Context and scope Ongoing work on a new provenance component for Taverna myGrid consortium Scope: capture raw provenance events –data transformations, data transfers store one lineage graph for each dataflow execution query over single or multiple lineage graphs

IPAW'08 – Salt Lake City, Utah, June 2008 Example (Taverna) dataflow QTL -> genes -> Kegg pathways

IPAW'08 – Salt Lake City, Utah, June 2008 Some user questions on lineage on a single workflow run: –find all genes that participate in some pathway p –find all pathways derived from Uniprot genes –describe the complete derivation of each pathway in which gene g is involved on a collection of runs: –find all distinct pathways produced by runs of a dataflow [over a period of time, produced by a member of my group,...]

IPAW'08 – Salt Lake City, Utah, June 2008 Shortcomings of lineage data Granularity –risk of returning trivial answers –“all outputs depend on all inputs” Semantics –Results not expressed in the language of the designer Abstraction level, noise – the “latent data model” –many processors are irrelevant – shims, mundane tasks

IPAW'08 – Salt Lake City, Utah, June 2008 The need for selective annotations As long as processors are black boxes, these remain difficult problems Adding annotations to processors is tempting Scope of this work: to explore the “gray box” region simple annotations with minimal semantics driving principle: justified by technical benefits –precision of query results –efficiency of query processing

IPAW'08 – Salt Lake City, Utah, June 2008 Test dataflow model P 1 extract query terms P 3 query1 P 4 query 2 P 6 merge results P 1 V I1 documents configuration merged results number of duplicates P 2 query prep P 5 post-proc P 7 sort P 1 V I2 P 1 V O1 P 2 V I1 P 4 V I1 P 2 V O1 P 4 V O1 P 3 V I1 P 5 V I1 P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 V O1 P 6 V O2 P7VIP7VI P7VOP7VO

IPAW'08 – Salt Lake City, Utah, June 2008 Two main annotation types Focusing: processor selection some processors are more interesting than others “boring” annotations query-time user selection of interesting processors Precision: fine-grained lineage tracing goal: trace lineage of individual items within a collection

IPAW'08 – Salt Lake City, Utah, June 2008 Abstraction by modularization Lucene_query NERecognize extract diseases from OMIM shims

IPAW'08 – Salt Lake City, Utah, June 2008 Abstraction by selection select

IPAW'08 – Salt Lake City, Utah, June 2008 Abstraction by selection select

IPAW'08 – Salt Lake City, Utah, June 2008 Focusing – processor selection P 1 extract query terms P 3 query1 P 4 query 2 P 6 merge results P 1 V I1 P 2 query prep P 5 post-proc P 7 sort P 1 V I2 P 1 V O1 P 2 V I1 P 4 V I1 P 2 V O1 P 4 V O1 P 3 V I1 P 5 V I1 P 3 V O1 P 5 V O1 P 6 V I1 P 6 V I2 P 6 V O1 P 6 V O2 P7VIP7VI P7VOP7VO = b = a1= a2 = b = g P4 is the only interesting processor Assume all values atomic Query: lineage(P 7 V O,{P 4 })‏ Goal: avoid recursive queries on instance tables Idea: use recursion on static model to generate a targeted query execute query only once

IPAW'08 – Salt Lake City, Utah, June 2008 Precision: elements within collections Problem: xform() also applies to list values It may be impossible to trace individual elements –“which pathways (out) depend on which genes (in)”? Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors Goal: extend the query generation idea just sketched to trace element-level lineage within collections Approach: exploit static typing of Taverna processors P1P1 P 1 V o : l(s) = [a, b, c] P 2 V I : l(s) = [a, b, c] P2P2 P1P1 P 1 V o : l(s) = [a, b, c] P 2 V I : s [a, b, c] P2P2 Taverna resolves mismatches on nesting levels: (map P 2 [a,b,c])‏

IPAW'08 – Salt Lake City, Utah, June 2008 Loss of precision in transformations PV I : l(s) = [a, b, c] P PV O : s = x possible behaviours: selection of an element aggregation function f() useful annotation: lineage(PV O ) = f(PV I )‏ PV I : s = a P PV O : s = a' PV I : s = a P PV O : l(s) = [x, y, z] P PV I : l(s) = [a, b, c] PV O : l(s) = [x, y] PV O : l(s) = [a',b',c'] “lossless” transformations only useful annotation: P is index-preserving: PV O [i] = PV I [i] lineage(PV O [i]) = PV I [i] x  [a, b, c] lossy x  [a, b, c] y  [a, b, c]

IPAW'08 – Salt Lake City, Utah, June 2008 Cooperative processors PV I : l(s) = [a, b, c] P PV O : s = x P PV I : l(s) = [a, b, c] PV O : l(s) = [x, y] –Passive processors do not contribute explicit provenance info –Cooperative processors actively feed metadata to the lineage service Dynamic annotations: Static annotations: aggregation f()‏ PV O [i] = PV I [i] selection: x = PV I [i]‏ sorting: PV O =  (PV I )‏‏

IPAW'08 – Salt Lake City, Utah, June 2008 Other annotations Distinction between configuration and input data –PVI 3 is a configuration parameter –compare effect of different config. across multiple runs specific functional dependencies [ PV I1, PV I2 ]  PV O stateless processor –execute process  retrieve provenance PV I1 P PV O PV I2 PV I3 More evaluation needed on these

IPAW'08 – Salt Lake City, Utah, June 2008 Towards a 2 tier provenance model Taverna runtime P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6 dataflow topology + raw lineage events Lineage service lineage database (RDB)‏ structural annotations query semantic resource annotations “describe the derivation of each pathway through Kegg, in which gene g is involved” reference ontologies Semantic overlays

IPAW'08 – Salt Lake City, Utah, June 2008 Conclusions A data lineage model for Taverna workflows Raw lineage data has shortcomings A few, selected lightweight annotations added in a principled way –win-win: –helpful to users –and enable query optimization Form the base layer in a broader approach to efficient querying of semantic provenance for e- science Ongoing implementation