Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Reuse Fitness Assessment Using Provenance

Similar presentations


Presentation on theme: "Data Reuse Fitness Assessment Using Provenance"— Presentation transcript:

1 Data Reuse Fitness Assessment Using Provenance
By Nicholas Car, June 13th, 2017 For the ESIP Information Quality Cluster Teleconference ?

2 Outline Purpose of my SciDataCon2016 presentation
Thoughts on DFRA since then

3 Outline Purpose of my SciDataCon2016 presentation The paper itself
The reason for it The major claim as I now see it Thoughts on DFRA since then

4 Outline Purpose of my SciDataCon2016 presentation The paper itself
The reason for it The major claim as I now see it Thoughts on DFRA since then Actually building graphs RDA’s provenance group

5 SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance”

6 SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance

7 SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance Posited solution: Use a standard for provenance (PROV) to enable automated assessments of data fitness

8 SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” Premise: Assessing the fitness of data for reuse may require knowledge of how that data was produced - provenance Posited solution: Use a standard for provenance (PROV) to enable automated assessments of data fitness Assessment scenarios: Properties of ancestor data Properties of agents involved in data production The methods used in data production

9 SciDataCon2016 – My Paper “Data Reuse Fitness Assessment Using Provenance” Scenarios using ‘forward provenance’: Data esteem measured by reported reuse Properties of derived data Properties of agents involved in derived data production The methods used in derived data production Mechanics: Demo code at

10 SciDataCon2016 – My Paper The repo: DOIs, Provenance & Vocabs

11 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production “…the function assess_min_drep_points() finds target data’s ancestor data and then finds the agents (people) associated with those ancestors. It then determines fitness based on owners’ “dataOwnerRepPoints”, an invented measure of a data owner’s reputation. It passes … when the minimum dataOwnerRepPoints for an ancestor is set to 3 and fails when set to 6 as the ancestors have either 5 or 10 points.”

12 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production “…the function assess_min_drep_points() finds target data’s ancestor data and then finds the agents (people) associated with those ancestors. It then determines fitness based on owners’ “dataOwnerRepPoints”, an invented measure of a data owner’s reputation. It passes … when the minimum dataOwnerRepPoints for an ancestor is set to 3 and fails when set to 6 as the ancestors have either 5 or 10 points.” Appeal to authority (might be fine) Potentially many ways to represent value of agents

13 SciDataCon2016 – My Paper Example data in the paper

14 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production 1 4 wasDerivedFrom 5 2 3 Derivation of Dataset 5 from 4 & 3, in turn from 1 & 2

15 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production Nick Car 1 4 wasAttributedTo 5 2 3 Attribution of Datasets to Agents Santa Claus

16 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production dataRepPoints 5 Nick Car 1 4 5 2 3 Attribution of Agents with “dataRepPoints” Santa Claus dataRepPoints 10

17 SciDataCon2016 – My Paper Scenario 2: Properties of agents involved in data production Reject Dataset 5 datasets where any ancestor was produced by an Agent with fewer than 5 dataRepPoints: ASK { :d5 prov:wasDerivedFrom+ ?ancestor . ?ancestor prov:wasAttributedTo ?agent . ?agent drep:dataRepPoints ?pnts . FILTER (?pnts < 5) . }

18 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in paper’s code Principle the same for descendant's’ methods: graph following

19 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production used wasGeneratedBy Subj. Activity M Descendent used Method X This is how PROV would allow the association of a method with a dataset’s production

20 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production used wasGeneratedBy Subj. Activity M Descendent used Method X Class: Plan Indicated instructions, not ‘pure’ data

21 If we had methods catalogued…
SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production If we had methods catalogued… used wasGeneratedBy Subj. Activity M Descendent Collection used Method X Class: Plan

22 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in my repo Principle the same for descendant's’ methods: graph following

23 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Only the function for ancestor methods given in my repo Principle the same for descendant's’ methods: graph following Using PROV-O, this could be a SPARQL query: Was data generated using the subject dataset and Dataset x? ASK { ?activity_m prov:used ?dataset_subject ; prov:used <Method_X_URI> . }

24 This is how we at GA actually can store the relevant associations
SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production General Pattern Within GA’s Catalogue used wasGeneratedBy wasDerivedFrom Subj. Activity M Descendent Subj. Descendent used Method X Method X wasInfluencedBy This is how we at GA actually can store the relevant associations

25 SciDataCon2016 – My Paper Scenario 4: The methods used in derived data production Within GA’s Catalogue eCat wasDerivedFrom Subj. Descendent Code Repos Method X wasInfluencedBy Doc Repo Github, Bitbucket & document repos + catalogue used to store the relevant artefacts

26 SciDataCon2016 – the reason for my paper
The session: Data Fitness “Data fitness is multifaceted and covers various aspects related to a dataset such as its level of annotation, curation, peer review, and citability.”

27 SciDataCon2016 – the reason for my paper
The session: Data Fitness “Data fitness is multifaceted and covers various aspects related to a dataset such as its level of annotation, curation, peer review, and citability.” I was worried about a discussion of the properties of the dataset only I was concerned about Yet Another Metadata Model… We really assess dataset’s reuse potential in context I think of context as a graph

28 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: Ensuring and Improving Information Quality for Earth Science Data and Products – Role of the ESIP Information Quality Cluster All of you! (Rama, Ge Peng, David Moroni & Shie Chung-Lin) Discussion about the IQC, scientific, product, stewardship & service quality, how ESIP agencies are expressing quality, steps now being taken to improve.

29 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: DSA and FAIR: The best guarantee for the open and long term accessibility of research data. Ingrid Dillo, Rob Hooft Description of DSA & FAIR principles, applicability & a call for wider use

30 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: DSA and FAIR: The best guarantee for the open and long term accessibility of research data. Ingrid Dillo, Rob Hooft Description of DSA & FAIR principles, applicability & a call for wider use Sounds fine. I’ve just applied for the DSA for our repo

31 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: How do we define data usability? Michael Diepenbroek, Markus Stocker Inherent (measurable) & non-inherent (subjective) properties of data, application-specific quality. Different sorts of quality, as per ICQ ideas (access, management etc.), how to describe management quality.

32 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: How do we define data usability? Michael Diepenbroek, Markus Stocker Inherent (measurable) & non-inherent (subjective) properties of data, application-specific quality. Different sorts of quality, as per ICQ ideas (access, management etc.), how to describe management quality. I see Michael was your last speaker, I saw Markus in Australia 2 weeks ago I don’t dispute their goals but I do want to make specific background modelling (and by implication, system) choices in order to present best flexibility of particular assessment regimes in the future – use a graph!

33 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: Developing a badge system for data fitness Helena Cousijn Further promoting approaches taken by the DSA to rate data repos.

34 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: Developing a badge system for data fitness Helena Cousijn Further promoting approaches taken by the DSA to rate data repos. Sounds fine

35 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: Data Selection: Survival of the Fittest Pesant Stéphane Presentation of datasets in PANGEA. Key message: that DFRA can vary a lot so quality should be assessed at a very granular level and aggregated to higher levels in the form of stats.

36 SciDataCon2016 – the reason for my paper
The session: Data Fitness Other papers in that session: Data Selection: Survival of the Fittest Pesant Stéphane Presentation of datasets in PANGEA. Key message: that DFRA can vary a lot so quality should be assessed at a very granular level and aggregated to higher levels in the form of stats. A graph layer approach may help with this as graph info can be aggregated/abstracted from

37 SciDataCon2016 – the reason for my paper
The session: Data Fitness Reflection: I think the paper, as presented in 2016, was valuable in making the case for: Storing more than just properties of datasets Also associations between them, each other, and other things That upper data models can be used (PROV) even when there are domain-specific requirements like multiple possible DFRA objects & queries

38 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA: Creating citable objects (RDA, FORCE11) Creating attribution graphs (Scholix, RDA/TDWG) Generation of specific metadata standards Perhaps based on ISO 19157:2013 IQC

39 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways

40 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways The more standardised the mechanisms, the more attention can be placed on specific methods of DFRA

41 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways The more standardised the mechanisms, the more attention can be placed on specific methods of DFRA Say hello to the Provenance Patterns WG!

42 Recent thoughts on DFRA
Provenance Patterns WG Creating patterns for DFRA is a high priority as provenance is all about trust Activities: Common provenance Use Cases Provenance design patterns Sharing provenance Strategies for enterprise provenance management Tools for provenance Provenance data collections

43 Recent thoughts on DFRA
Provenance Patterns WG Creating patterns for DFRA is a high priority as provenance is all about trust Activities: Common provenance Use Cases Provenance design patterns Sharing provenance Strategies for enterprise provenance management Tools for provenance Provenance data collections Many of these are relevant to DFRA, that’s no accident…

44 Recent thoughts on DFRA
Provenance Patterns WG Creating patterns for DFRA is a high priority as provenance is all about trust Patterns output: Scenario 4 presented as a pattern:

45 Recent thoughts on DFRA

46 Recent thoughts on DFRA
Provenance Patterns WG Creating patterns for DFRA is a high priority as provenance is all about trust Patterns output: Scenario 4 presented as a pattern: I will be able to formulate some reuse questions as provenance Use Cases… with your input!

47 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways I am seeking input from provenance users – you – in order to provide the best Patterns for you and others

48 Recent thoughts on DFRA
There are many initiatives that can assist with DFRA I will be focussing on the ‘pipeline’ to help people expose all the information that could be useful for DFRA, in standardised ways I am seeking input from provenance users – you – in order to provide the best Patterns for you and others Please contribute to the PP WG via UC submission

49 Data Reuse Fitness Assessment Using Provenance
Nicholas Car Data Architect Geoscience Australia Prov Patterns WG: Prov Patterns DB:


Download ppt "Data Reuse Fitness Assessment Using Provenance"

Similar presentations


Ads by Google