Application of Provenance for Automated and Research Driven Workflows Tara Gibson June 17, 2008
Motivation Identify provenance models and architectures that will support a variety of real world scientific research Promote collaboration and interoperability Review requirements identified by the community Identify new requirements from our own use case studies that span a number of domains Methods
Use case studies Encountered two types of workflow Automated (eg. Pipelines) User-Driven, research oriented (eg. Digital Libraries, Data Lineage)
Use case type comparison
Sensor Analysis SOA based runtime intrusion detection system to prevent attacks on sensitive systems. Large scale data streaming (~30TB per day) Too much provenance, system would be quickly overwhelmed, record only significant events
Subsurface Modelling Understand how contaminants react and move through environments by simulating experiments that would not be feasible otherwise Research often follows many branches of investigation with complex relationships between simulations.
Archive, Data Mining Document data context and relationships to improve effectiveness of facility Use of data extraction and harvesting to capture provenance and meta-data Track relationships between experiments and computations Allows for better collaboration and understanding
Requirements Summary Record provenance about process, data, relationships Group items together for comparison Record arbitrary meta-data Standards-based search capability Examine process and data that led to result Identify the overall impact on a workflow due to changes in process/data
Influences on Architecture
Challenges Multiple language bindings Information overload Scalability Should scale to billions of triples Augmentation – user annotation Filtering User/Application specific views
Questions...