Data Reuse Fitness Assessment Using Provenance

Slides:

Advertisements

Similar presentations

Performance Assessment

Advertisements

Robert J. Mislevy & Min Liu University of Maryland Geneva Haertel SRI International Robert J. Mislevy & Min Liu University of Maryland Geneva Haertel SRI.

The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.

School of Education, CASEwise: A Case-based Online Learning Environment for Teacher Professional Development Chrystalla.

ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.

Access and Query Task Force Status at F2F1 Simon Miles.

2.An overview of SDMX (What is SDMX? Part I) 1 Edward Cook Eurostat Unit B5: “Central data and metadata services” SDMX Basics course, October 2015.

4 way comparison of Data Citation Principles: Amsterdam Manifesto, CoData, Data Cite, Digital Curation Center FORCE11 Data Citation Synthesis Group.

Data Foundation IG DF Organizing Chairs: Gary Berg-Cross & Peter Wittenburg.

C. Bettini, X.S. Wang, S. Jajodia Secure Data Management (SDM 2005), pp , DOI: / _13. David MacDonald.

RDA, 5th Plenary, San Diego WDS Certification Objective: building trust in the usage of data & data services Michael Diepenbroek Rorie Edmunds Mustapha.

Certificate IV in Project Management Assessment Outline Course Number Qualification Code BSB41507.

SciDataCon 2014, WDS Forum, Dehli WDS Certification Objective: building trust in the usage of data & data services Michael Diepenbroek Rorie Edmunds Mustapha.

Advanced Higher Computing Science

Automatically Calculating the Adherence to License Requirements

Spelling and beyond – Curriculum

CESSDA SaW Training on Trust, Identifying Demand & Networking

Software Project Configuration Management

How to define what you are actually looking for…

DSA and FAIR: a perfect couple

Agreeing about agreements: modelling social contracts, people and data

RDA Data Fabric (DF) Interest Group Peter Wittenburg & Gary Berg-Cross

Classroom Discussions to Support Problem Solving

Justin Buck OceanSITES data Incentives for participation: Data citation & data services Justin Buck

Design Patterns (Chapter 6 of Text Book – Study just 8)

Project: Improving accessibility of digitally created archives

The Systems Engineering Context

QlikView Licensing.

Information Quality Cluster - Fostering Collaborations

Fitness for use: Users of the U. S

SSP4000 Introduction to the Research Process Wk9: Introduction to qualitative research, Part 2 The focus of week 9 is to introduce students to the characteristics.

Active Data Management in Space 20m DG

What can provenance do for me?

PID centric fabric constructed piece by piece

Entry Task #1 – Date Self-concept is a collection of facts and ideas about yourself. Describe yourself in your journal in a least three sentences. What.

World of work How do tasks bring the WoW into the classroom?

Introduction Helena Cousijn, Claire Austin & Michael Diepenbroek

Chapter 10: Process Implementation with Executable Models

Sophia Lafferty-hess | research data manager

The purposes of grading student work

Spelling and beyond Literacy Toolkit HGIOS

Using the EFQM Excellence Model to support the role of a trustee

Measuring Data Quality and Compilation of Metadata

SISAI STATISTICAL INFORMATION SYSTEMS ARCHITECTURE AND INTEGRATION

The Scientific Method scientific method noun

From Observational Data to Information (OD2I IG )

2. An overview of SDMX (What is SDMX? Part I)

2. An overview of SDMX (What is SDMX? Part I)

school self-evaluation and improvement toolkit

Repository Platforms for Research Data Interest Group: Requirements, Gaps, Capabilities, and Progress Robert R. Downs1, 1 NASA.

IS-ENES Cases Seven use cases are listed as data lifecycle steps A B C

SDMX Information Model: An Introduction

Analyzing Student Work Sample 2 Instructional Next Steps

HingX Project Overview

Module 9: “Top Twelve” LC RDA for NASIG - June 1, 2011

WDS/RDA Assessment of Data Fitness for Use Claire Austin, Helena Cousijn, Michael Diepenbroek, Jon Petters.

Applying Use Cases (Chapters 25,26)

Applying Use Cases (Chapters 25,26)

Policy Frameworks: building a firm foundation for your IR

eScience - FAIR Science

WORQ WORKSHOP How to Write International Quality Publications

A modest attempt at measuring and communicating about quality

World of work How do tasks bring the WoW into the classroom?

Making Thinking Visible

ONTOMERGE Ontology translations by merging ontologies Paper: Ontology Translation on the Semantic Web by Dejing Dou, Drew McDermott and Peishen Qi 2003.

Helena Cousijn, Claire Austin, Jonathan Petters & Michael Diepenbroek

Introduction to reference metadata and quality reporting

7. Introduction to the main SDMX objects for metadata exchange

Interoperability of metadata systems: Follow-up actions

Classifications and Linked Open Data Formalizing the structure and content of statistical classifications Item 9.1 Standards Working Group Luxembourg,

Presentation transcript:

Data Reuse Fitness Assessment Using Provenance By Nicholas Car, June 13th, 2017 For the ESIP Information Quality Cluster Teleconference ?

Outline Purpose of my SciDataCon2016 presentation Thoughts on DFRA since then

Outline Purpose of my SciDataCon2016 presentation The paper itself The reason for it The major claim as I now see it Thoughts on DFRA since then

Outline Purpose of my SciDataCon2016 presentation The paper itself The reason for it The major claim as I now see it Thoughts on DFRA since then Actually building graphs RDA’s provenance group

SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” http://www.scidatacon.org/2016/sessions/53/paper/47/

SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” http://www.scidatacon.org/2016/sessions/53/paper/47/ Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance

SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” http://www.scidatacon.org/2016/sessions/53/paper/47/ Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance Posited solution: Use a standard for provenance (PROV) to enable automated assessments of data fitness

SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” http://www.scidatacon.org/2016/sessions/53/paper/47/ Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance Posited solution: Use a standard for provenance (PROV) to enable automated assessments of data fitness Assessment scenarios: Properties of ancestor data Properties of agents involved in data production The methods used in data production

SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” http://www.scidatacon.org/2016/sessions/53/paper/47/ Scenarios using ‘forward provenance’: Data esteem measured by reported reuse Properties of derived data Properties of agents involved in derived data production The methods used in derived data production Mechanics: Demo code at http://promsns.org/repo/prov-data-fitness

SciDataCon2016 – My Paper The repo: DOIs, Provenance & Vocabs

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production “…the function assess_min_drep_points() finds target data’s ancestor data and then finds the agents (people) associated with those ancestors. It then determines fitness based on owners’ “dataOwnerRepPoints”, an invented measure of a data owner’s reputation. It passes … when the minimum dataOwnerRepPoints for an ancestor is set to 3 and fails when set to 6 as the ancestors have either 5 or 10 points.”

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production “…the function assess_min_drep_points() finds target data’s ancestor data and then finds the agents (people) associated with those ancestors. It then determines fitness based on owners’ “dataOwnerRepPoints”, an invented measure of a data owner’s reputation. It passes … when the minimum dataOwnerRepPoints for an ancestor is set to 3 and fails when set to 6 as the ancestors have either 5 or 10 points.” Appeal to authority (might be fine) Potentially many ways to represent value of agents

SciDataCon2016 – My Paper Example data in the paper

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production 1 4 wasDerivedFrom 5 2 3 Derivation of Dataset 5 from 4 & 3, in turn from 1 & 2

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production Nick Car 1 4 wasAttributedTo 5 2 3 Attribution of Datasets to Agents Santa Claus

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production dataRepPoints 5 Nick Car 1 4 5 2 3 Attribution of Agents with “dataRepPoints” Santa Claus dataRepPoints 10

SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production Reject Dataset 5 datasets where any ancestor was produced by an Agent with fewer than 5 dataRepPoints: ASK { :d5 prov:wasDerivedFrom+ ?ancestor . ?ancestor prov:wasAttributedTo ?agent . ?agent drep:dataRepPoints ?pnts . FILTER (?pnts < 5) . }

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in paper’s code Principle the same for descendant's’ methods: graph following

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production used wasGeneratedBy Subj. Activity M Descendent used Method X This is how PROV would allow the association of a method with a dataset’s production

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production used wasGeneratedBy Subj. Activity M Descendent used Method X Class: Plan Indicated instructions, not ‘pure’ data

If we had methods catalogued… SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production If we had methods catalogued… used wasGeneratedBy Subj. Activity M Descendent Collection used Method X Class: Plan

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in my repo Principle the same for descendant's’ methods: graph following

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in my repo Principle the same for descendant's’ methods: graph following Using PROV-O, this could be a SPARQL query: Was data generated using the subject dataset and Dataset x? ASK { ?activity_m prov:used ?dataset_subject ; prov:used <Method_X_URI> . }

This is how we at GA actually can store the relevant associations SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production General Pattern Within GA’s Catalogue used wasGeneratedBy wasDerivedFrom Subj. Activity M Descendent Subj. Descendent used Method X Method X wasInfluencedBy This is how we at GA actually can store the relevant associations

SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Within GA’s Catalogue eCat wasDerivedFrom Subj. Descendent Code Repos Method X wasInfluencedBy Doc Repo Github, Bitbucket & document repos + catalogue used to store the relevant artefacts

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ “Data fitness is multifaceted and covers various aspects related to a dataset such as its level of annotation, curation, peer review, and citability.”

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ “Data fitness is multifaceted and covers various aspects related to a dataset such as its level of annotation, curation, peer review, and citability.” I was worried about a discussion of the properties of the dataset only I was concerned about Yet Another Metadata Model… We really assess dataset’s reuse potential in context I think of context as a graph

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: Ensuring and Improving Information Quality for Earth Science Data and Products – Role of the ESIP Information Quality Cluster All of you! (Rama, Ge Peng, David Moroni & Shie Chung-Lin) Discussion about the IQC, scientific, product, stewardship & service quality, how ESIP agencies are expressing quality, steps now being taken to improve.

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: DSA and FAIR: The best guarantee for the open and long term accessibility of research data. Ingrid Dillo, Rob Hooft Description of DSA & FAIR principles, applicability & a call for wider use

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: DSA and FAIR: The best guarantee for the open and long term accessibility of research data. Ingrid Dillo, Rob Hooft Description of DSA & FAIR principles, applicability & a call for wider use Sounds fine. I’ve just applied for the DSA for our repo

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: How do we define data usability? Michael Diepenbroek, Markus Stocker Inherent (measurable) & non-inherent (subjective) properties of data, application-specific quality. Different sorts of quality, as per ICQ ideas (access, management etc.), how to describe management quality.

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: How do we define data usability? Michael Diepenbroek, Markus Stocker Inherent (measurable) & non-inherent (subjective) properties of data, application-specific quality. Different sorts of quality, as per ICQ ideas (access, management etc.), how to describe management quality. I see Michael was your last speaker, I saw Markus in Australia 2 weeks ago I don’t dispute their goals but I do want to make specific background modelling (and by implication, system) choices in order to present best flexibility of particular assessment regimes in the future – use a graph!

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: Developing a badge system for data fitness Helena Cousijn Further promoting approaches taken by the DSA to rate data repos.

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: Developing a badge system for data fitness Helena Cousijn Further promoting approaches taken by the DSA to rate data repos. Sounds fine

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: Data Selection: Survival of the Fittest Pesant Stéphane Presentation of datasets in PANGEA. Key message: that DFRA can vary a lot so quality should be assessed at a very granular level and aggregated to higher levels in the form of stats.

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Other papers in that session: Data Selection: Survival of the Fittest Pesant Stéphane Presentation of datasets in PANGEA. Key message: that DFRA can vary a lot so quality should be assessed at a very granular level and aggregated to higher levels in the form of stats. A graph layer approach may help with this as graph info can be aggregated/abstracted from

SciDataCon2016 – the reason for my paper The session: Data Fitness http://www.scidatacon.org/2016/sessions/53/ Reflection: I think the paper, as presented in 2016, was valuable in making the case for: Storing more than just properties of datasets Also associations between them, each other, and other things That upper data models can be used (PROV) even when there are domain-specific requirements like multiple possible DFRA objects & queries

Recent thoughts on DFRA There are many initiatives that can assist with DFRA: Creating citable objects (RDA, FORCE11) Creating attribution graphs (Scholix, RDA/TDWG) Generation of specific metadata standards Perhaps based on ISO 19157:2013 IQC …

Recent thoughts on DFRA There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways

Recent thoughts on DFRA There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways The more standardised the mechanisms, the more attention can be placed on specific methods of DFRA

Recent thoughts on DFRA There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways The more standardised the mechanisms, the more attention can be placed on specific methods of DFRA Say hello to the Provenance Patterns WG!

Recent thoughts on DFRA Provenance Patterns WG https://www.rd-alliance.org/groups/provenance-patterns-wg Creating patterns for DFRA is a high priority as provenance is all about trust Activities: Common provenance Use Cases Provenance design patterns Sharing provenance Strategies for enterprise provenance management Tools for provenance Provenance data collections

Recent thoughts on DFRA Provenance Patterns WG https://www.rd-alliance.org/groups/provenance-patterns-wg Creating patterns for DFRA is a high priority as provenance is all about trust Activities: Common provenance Use Cases Provenance design patterns Sharing provenance Strategies for enterprise provenance management Tools for provenance Provenance data collections Many of these are relevant to DFRA, that’s no accident…

Recent thoughts on DFRA Provenance Patterns WG https://www.rd-alliance.org/groups/provenance-patterns-wg Creating patterns for DFRA is a high priority as provenance is all about trust Patterns output: Scenario 4 presented as a pattern: http://patterns.promsns.org/pattern/4

Recent thoughts on DFRA http://patterns.promsns.org/pattern/4

Recent thoughts on DFRA Provenance Patterns WG https://www.rd-alliance.org/groups/provenance-patterns-wg Creating patterns for DFRA is a high priority as provenance is all about trust Patterns output: Scenario 4 presented as a pattern: http://patterns.promsns.org/pattern/4 I will be able to formulate some reuse questions as provenance Use Cases… with your input!

Recent thoughts on DFRA There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways I am seeking input from provenance users – you – in order to provide the best Patterns for you and others

Recent thoughts on DFRA There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways I am seeking input from provenance users – you – in order to provide the best Patterns for you and others Please contribute to the PP WG via UC submission

Data Reuse Fitness Assessment Using Provenance Nicholas Car Data Architect Geoscience Australia nicholas.car@ga.gov.au Prov Patterns WG: https://www.rd-alliance.org/groups/provenance-patterns-wg Prov Patterns DB: http://patterns.promsns.org/