Towards a Provenance Architecture Karen Schuchardt PNNL.

Slides:



Advertisements
Similar presentations
GRADD: Scientific Workflows. Scientific Workflow E. Science laboris Workflows are the new rock and roll of eScience Machinery for coordinating the execution.
Advertisements

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
LEAD Portal: a TeraGrid Gateway and Application Service Architecture Marcus Christie and Suresh Marru Indiana University LEAD Project (
DELOS Highlights COSTANTINO THANOS ITALIAN NATIONAL RESEARCH COUNCIL.
Kensington Oracle Edition: Open Discovery Workflow Meets Oracle 10g Professor Yike Guo.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
ARCHER Overview October e-Research Challenges Acquiring data from instruments Storing and managing large quantities of data Processing large quantities.
Nov Copyright Galdos Systems Inc. November 2001 Geography Markup Language Enabling the Geo-spatial Web.
An Architecture for Creating Collaborative Semantically Capable Scientific Data Sharing Infrastructures Anuj R. Jaiswal, C. Lee Giles, Prasenjit Mitra,
Tools and Services for the Long Term Preservation and Access of Digital Archives Joseph JaJa, Mike Smorul, and Sangchul Song Institute for Advanced Computer.
Report Distribution Report Distribution in PeopleTools 8.4 Doug Ostler & Eric Knapp 7264.
Cyberinfrastructure for Rapid Prototyping Capability Tomasz Haupt, Anand Kalyanasundaram, Igor Zhuk, Vamsi Goli Mississippi State University GeoResouces.
26-28 th April 2004BioXHIT Kick-off Meeting: WP 5.2Slide 1 WorkPackage 5.2: Implementation of Data management and Project Tracking in Structure Solution.
Provenance in my Grid Jun Zhao School of Computer Science The University of Manchester, U.K. 21 October, 2004.
January, 23, 2006 Ilkay Altintas
Electronic Notebooks: An Interface Component for Semantic Records Systems James D. Myers, Michael Peterson, K Prasad Saripalli, Tara Talbott Mathematics.
Dr. Kurt Fendt, Comparative Media Studies, MIT MetaMedia An Open Platform for Media Annotation and Sharing Workshop "Online Archives:
1 © Netskills Quality Internet Training, University of Newcastle Metadata Explained © Netskills, Quality Internet Training.
GRACE Project IST EGAAP meeting – Den Haag, 25/11/2004 Giuseppe Sisto – Telecom Italia Lab.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Publishing and Visualizing Large-Scale Semantically-enabled Earth Science Resources on the Web Benno Lee 1 Sumit Purohit 2
SOS EGEE ‘06 GGF Security Auditing Service: Draft Architecture Brian Tierney Dan Gunter Lawrence Berkeley National Laboratory Marty Humphrey University.
Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski
material assembled from the web pages at
Usage of `provenance’: A Tower of Babel Luc Moreau.
Introduction to Apache OODT Yang Li Mar 9, What is OODT Object Oriented Data Technology Science data management Archiving Systems that span scientific.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Chapter 6 SAS ® OLAP Cube Studio. Section 6.1 SAS OLAP Cube Studio Architecture.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Phase II Additions to LSG Search capability to Gene Browser –Though GUI in Gene Browser BLAST plugin that invokes remote EBI BLAST service Working set.
1 Metadata –Information about information – Different objects, different forms – e.g. Library catalogue record Property:Value: Author Ian Beardwell Publisher.
©Ferenc Vajda 1 Semantic Grid Ferenc Vajda Computer and Automation Research Institute Hungarian Academy of Sciences.
Individualized Knowledge Access David Karger Lynn Andrea Stein Mark Ackerman Ralph Swick.
Wrapping Scientific Applications As Web Services Using The Opal Toolkit Wrapping Scientific Applications As Web Services Using The Opal Toolkit Sriram.
Large Scale Nuclear Physics Calculations in a Workflow Environment and Data Provenance Capturing Fang Liu and Masha Sosonkina Scalable Computing Lab, USDOE.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
ICCS WSES BOF Discussion. Possible Topics Scientific workflows and Grid infrastructure Utilization of computing resources in scientific workflows; Virtual.
Service Service metadata what Service is who responsible for service constraints service creation service maintenance service deployment rules rules processing.
Core Integration Web Services Dean Krafft, Cornell University
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Technical Update 2008 Sandy Payette, Executive Director Eddie Shin, Senior Developer April 3, 2008 Open Repositories 2008, Fedora User Group.
Project Database Handler The Project Database Handler is a brokering application that mediates interactions between the project database and the external.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.
Scientific Annotation Middleware (SAM) Jim Myers, Elena Mendoza PNNL Al Geist, Jens Schwidder ORNL.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
26/05/2005 Research Infrastructures - 'eInfrastructure: Grid initiatives‘ FP INFRASTRUCTURES-71 DIMMI Project a DI gital M ulti M edia I nfrastructure.
Adapting the Electronic Laboratory Notebook for the Semantic Era Tara Talbott, Michael Peterson, Jens Schwidder, James D. Myers 2005 International Symposium.
SEEK Science Environment for Ecological Knowledge l EcoGrid l Ecological, biodiversity and environmental data l Computational access l Standardized, open.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
OGCE Workflow and LEAD Overview Suresh Marru, Marlon Pierce September 2009.
Semantic sewer pipe failure detection: Linked data approaches for discovering events Jonathan Yu | Research software engineer Environmental Information.
Workflow-Driven Science using Kepler Ilkay Altintas, PhD San Diego Supercomputer Center, UCSD words.sdsc.edu.
Automated Provenance Capture Architecture and Implementation Karen Schuchardt, Eric Stephan, Tara Gibson, George Chin PNNL.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
EBI is an Outstation of the European Molecular Biology Laboratory. Semantic Interoperability Framework Sarala M. Wimalaratne (RICORDO project)
Enhancements to Galaxy for delivering on NIH Commons
Accessing the VI-SEEM infrastructure
Joseph JaJa, Mike Smorul, and Sangchul Song
Middleware independent Information Service
Overview: Fedora Architecture and Software Features
VI-SEEM Data Repository
AMGA Web Interface Vincenzo Milazzo
Presentation transcript:

Towards a Provenance Architecture Karen Schuchardt PNNL

Kepler Provenence Meeting Jan 05 2 OutlineOutline Past and Present Work Use Cases Thoughts on Workflow Provenance and Architectures

Kepler Provenence Meeting Jan 05 3 Past and Present Provenance Work Ecce Chemistry Environment Electronic Laboratory Notebooks Collaboratory for Multi-Scale Chemical Science (CMCS) Scientific Annotation Middleware Towards a Semantic Data Grid for Systems Science mid 90s late 90s

Kepler Provenence Meeting Jan 05 4 Ecce Chemistry Environment Chemistry-based calculation workflow Provenance Captured as user performs actions W’s (who, what, when) Job submissionstatus Info Relationships (Xlinks) between calculations, outputs, inputs etc Linkbase for molecular dynamics multi-step processes WebDAV-based server captures all inputs, outputs and metadata Provenance used to provide at-a-glance summary of work performed, duplicate and rerun, search, Bind rules based on types and relationships

Kepler Provenence Meeting Jan 05 5 Electronic Laboratory Notebooks Hierarchical, Chronological Chapters/Pages/Notes File upload, sketch, text, equations, forms, image capture, … Add/View/Search Notes Records functionality: Non-repudiation - digital signatures and timestamps Persistence/completeness - write- once/no deletions/audit trail Standardized lifecycle – signing/witnessing policies, archiving, retention schedules, … Now based on WebDAV Provenance Structure of notebook Records data Mimetype-based functionality

Kepler Provenence Meeting Jan 05 6 Collaboratory for Multi-Scale Chemical Sciences (CMCS) Dublin Core for basic pedigree: title, creator, dates, publisher, is-referenced- by, references, replaces, is-replaced-by, has-version Dublin Core Element Set and Qualified Dublin Core Both XML and RDF to encode metadata values Use of XLink to express values of relationships CMCS properties for chemical science to enable searching: species name, CAS, chemical properties, and chemical formula. CMCS properties for defining scientific data: has-inputs, has-outputs, and is-part- of-project. CMCS properties for scientific publication and peer review annotations: is- sanctioned-by. Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!

Kepler Provenence Meeting Jan 05 7 Scientific Annotation Middleware Provides a node plus metadata/relationship view of underlying data sources Support put/get/search/access control of arbitrary data/metadata Configurable metadata extraction from binary/ASCII/XML files Configurable Data Translation Semantic/graph queries RDF Export Notebook Services (page display, signatures, timestamps, …) Pluggable security Direct connection between metadata and resource limits use as next generation provenance store

Kepler Provenence Meeting Jan 05 8 Towards a Semantic Data Grid Explore frameworks for advanced model-driven data integration capabilities Seamlessly integrate files, databases Automated scientific workflow mechanisms Capture, represent, and disseminate knowledge Identify changes via discovery mechanisms Internally funded 2 year project

Kepler Provenence Meeting Jan 05 9 Towards a Semantic Data Grid What proteins in my organism(s) are both predicted and shown by experiment to interact with E. Coli Resources required Microarray spreadsheets NCBI data services BIND data base DIP database Work-group specific databases Other data services Extraction Translation Merging HPC Services Public Web services Discovery

Kepler Provenence Meeting Jan Use Case - Personal Records Capture and organize display of provenance simplifies the job keeping track of activities performed over the course of long research process Example: Bioinformatisist performs data integration/analysis for many diverse projects. After 6 months, he/she can’t remember what a particular result pertained to or how it was generated.

Kepler Provenence Meeting Jan Use Case - Verifiability Data generated from instruments/experiments undergoes numerous automatic processes before becoming available to researcher(s) Example: High-throughput biology experiments run through several automated and in some cases manual processes before it becomes available to the bioinformatisist. The bioinformatisist often does not trust the data. They want to know who created, what was done to it, when it was generated….

Kepler Provenence Meeting Jan Use Case - Applicability Increasingly, research problems span disciplines or scales. Though data needs to move across these boundaries, it is often a manual process involving personal communications. Example: In the combustion multi-scale research environment, data generated at one scale (e.g. thermochemical data) serves as input to successive scales (e.g. mechanisms). But its not that simple - we must be able to determine the applicability of available data - are the theoretical underpinnings under which it was generated consistent with the intended use?

Kepler Provenence Meeting Jan Use Case - Best Practices By capturing and providing access to provenance of prior work, best practices can be shared. Example: This is a little bit hypothetical but… best practices can be shared by sharing workflow definitions or by viewing provenance (and inputs) from instances of workflows.

Kepler Provenence Meeting Jan Types of Provenance in Workflow Environment Interaction Provenance Data that moves between services State Provenance Data known only to the actor itself Observable Provenance Start/completion times Error detection

Kepler Provenence Meeting Jan Other Provenance Other Applications will record data Pedigree/Provenance Experiment Metadata Project Organization Categorization Detected Features Instrument logs Digital Signatures Endorsements Community Annotations Other workflow engines

Kepler Provenence Meeting Jan Logical Architecture Provenance Store(s) Query Interface Submission Interface User Recording Tools Portlets Annotator Notebooks Science Applications Client Query Library Client Submission Library Experiment Services Workflow engine Domain specific services Presentation Services Visualizer/ Browser Difference Visualizer Workflow construction Processing Services Difference Analyzer Quality Analyzer Extracted from escience Strawman - Moreau Provenance Store(s)

Kepler Provenence Meeting Jan Components of Physical Architecture One or more RDF triple stores Global naming service Arbitrary data stores for data referenced by the provenance Security services (pluggable for scalability)

Kepler Provenence Meeting Jan Workflow and Provenance Requires binding to provenance service Need mechanism to associate provenance from workflow instance Id? Links? Requires communication of service information or other mechanism for actors to contribute state provenance

Kepler Provenence Meeting Jan SummarySummary We’ve done a lot of work on provenance but see value in moving to more flexible architecture Workflow engines are just one component that can contribute to the provenance of research results. Provenance capture should be thought of as a cross-cutting technology Models for provenance need to be flexible allowing arbitrary content Provenance services need to be scalable low-footprint usages for individual applications large experimental facilities Virtual organizations