Paul J. Morris; Museum of Comparative Zoology

Slides:



Advertisements
Similar presentations
Metadata workshop, June The Workshop Workshop Timetable introduction to the Go-Geo! project metadata overview Go-Geo! portal hands on session.
Advertisements

LRI Validation Suite Meeting August 16, Agenda Review of LRI Validation Suite Charter/Overview Acquiring test data update Review of proposed test.
Basic guidelines for the creation of a DW Create corporate sponsors and plan thoroughly Determine a scalable architectural framework for the DW Identify.
Data Quality Class 3. Goals Dimensions of Data Quality Enterprise Reference Data Data Parsing.
1 ISO – Metadata Next Generation International consensus being built on structured metadata within a broader Geomatics Standard under ISO Technical.
José Costa Teixeira January 2015 Medication Data Capture and Management.
Distributed Data Analysis & Dissemination System (D-DADS) Prepared by Stefan Falke Rudolf Husar Bret Schichtel June 2000.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
IDs in and out of the database Entomological Collections Network (ECN) 2012 November 10 – 11, Knoxville, TN Debbie Paul, Greg Riccardi.
OJJDP Performance Measurement Training 1 Presented by: Dr. Kimberly Kempf-Leonard School of Social Sciences University of Texas at Dallas
Profiling Metadata Specifications David Massart, EUN Budapest, Hungary – Nov. 2, 2009.
Public Health Reporting Initiative: Stage 2 Draft Roadmap.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Page 1 ISO/IEC JTC 1/SC 7/WG 7 N Summary of the Alignment of System and Software Life Cycle Process Standards The material in this briefing.
Darwin Core Archive (DwC-A) validation: A New Collaborative Effort Christian Gendreau, Université de Montréal / Canadensys David P. Shorthouse, Université.
Component 11/Unit 8b Data Dictionary Understanding and Development.
In Dublin’s fair city, where the metadata are so pretty… John Roberts Archives New Zealand.
1 Interoperability of Spatial Data Sets and Services Data quality and Metadata: what is needed, what is feasible, next steps Interoperability of Spatial.
Resource Description and Access Deirdre Kiorgaard Australian Committee on Cataloguing Representative to the Joint Steering Committee for the Development.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Making Geological Map Data for the Earth Accessible OneGeology: assisting Geological Surveys worldwide to interoperate seamlessly on the Next Generation.
Laura Russell Programmer VertNet Buenos Aires (Argentina) 28 September 2011 Training course on biodiversity data publishing and.
Resource Description and Access (RDA) information session Deirdre Kiorgaard Australian Committee on Cataloguing Representative to the Joint Steering Committee.
 Data Quality Resources in Species Occurrence Digitization Allan Koch Veiga Etienne Americo Cartolano Jr Antonio Mauro Saraiva Agricultural Automation.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
P088; Presented in Canberra, 27 th March, 2008 GR000: Presented in Fremantle on 20 th October, 2008 GAIA RESOURCES Experiences in mobilizing biodiversity.
ISO 25010, Software Product Quality Model Welcome
Global Biodiversity Information Facility GLOBAL BIODIVERSITY INFORMATION FACILITY Hannu Saarenmaa EC CHM & GBIF European Regional Nodes Meeting Copenhagen,
Dan Rosauer Research School of Biology Australian National University Citing data in biogeography: The Atlas of Living Australia.
1 The Metadata Groups - Keith G Jeffery. 2 Positioning  Raise profile of metadata  Data first  Also software, resources, users  Achieve outputs/outcomes.
21/1/ Analysis - Model of real-world situation - What ? System Design - Overall architecture (sub-systems) Object Design - Refinement of Design.
Santi Thompson - Metadata Coordinator Annie Wu - Head, Metadata and Bibliographic Services 2013 TCDL Conference Austin, TX.
Laura Russell VertNet Meherzad Romer NatureServe Canada John Wieczorek
OBIS IODE PO OBIS INCOIS OBIS- SEAMAP Separate files OBIS Nodes Data providers Separate files GBIFLifeWatchGEOSSEOL,…CBDFAOISA Fail-over mirrorGeo-load.
Enterprise Architectures Course Code : CPIS-352 King Abdul Aziz University, Jeddah Saudi Arabia.
The Quality Gateway Chapter 11. The Quality Gateway.
U.S. Department of the Interior U.S. Geological Survey WaterML Presentation to FGDC SWG Nate Booth January 30, 2013.
ANNOUNCEMENTS. EVENTS CALENDAR TITLE OF EVENT Details about the event. Date: Month, Day Time: 00:00-00:00 Location: Campus Building.
Steve Simon MVP SQL Server BI
Using Kurator Tools for Data Quality and Cleaning Biodiversity Data
CS4311 Spring 2011 Process Improvement Dr
Getting to know the data, Getting to know all about the data
REPORTING SDG INDICATORS USING NATIONAL REPORTING PLATFORMS
Document, Index, Discover, Access
Component 11 Configuring EHRs
Steve Simon MVP SQL Server BI
MIWP 5 – Validation and conformity testing
IP Australia Quality Management System Risk Management Approach
GLOBAL BIODIVERSITY INFORMATION FACILITY
Project proposal for ISO 27001:2013 implementation
Ontology Evolution: A Methodological Overview
Data Quality Why should I care?
Data Stewardship Interest Group WGISS-45 Meeting
Lunch & Learn: Are you letting your users be your testers?
OBIS Data flows Dave Watts 8 March 2017 Data Centre, O&A.
Verification and Validation Unit Testing
The JISC IE Metadata Schema Registry
The JISC IE Metadata Schema Registry
From Observational Data to Information (OD2I IG )
MS Project Add-on for Earned Value Analysis Reporting
Course: Module: Lesson # & Name Instructional Material 1 of 32 Lesson Delivery Mode: Lesson Duration: Document Name: 1. Professional Diploma in ERP Systems.
Session 2: Metadata and Catalogues
Some Options for Non-MARC Descriptive Metadata
LAMAS Working Group June 2017
Prepared by Stephen M. Thebaut, Ph.D. University of Florida
Progress in the implementation of RTMCF1 Action Plan.
My name is VL, I work at the EEA, on EA, and particularly on developing a platform of exchange which aims at facilitating the planning and development.
Helena Cousijn, Claire Austin, Jonathan Petters & Michael Diepenbroek
Australian and New Zealand Metadata Working Group
Introduction to reference metadata and quality reporting
Presentation transcript:

Developing Standards for Data Quality Tests and Assertions using a Fitness for Use Framework Paul J. Morris; Museum of Comparative Zoology Lee Belbin; (Convener, BDQ TG2) The Atlas of Living Australia Abby Benson; US Geological Survey; OBIS Arthur Chapman; (Convener BDQ IG) Australian Biodiversity Inf. Services Shelley James; iDigBio; RBG Sidney Allan Koch Veiga; (Convener BDQ TG1) University of São Paulo Miles Nicholls; (Convener BDQ TG3) CSIRO Pieter Provoost; UNESCO Antonio Mauro Saraiva; (Convener BDQ IG) University of São Paulo Dmitry Schigel; GBIF Alexander M. Thompson; iDigBio Dave Watson; OBIS John Wiezorek; Museum of Comparative Zoology; VertNet Paula Zermoglio; University of Buenos Aires

Conveners: Arthur Chapman, Antonio Mauro Saraiva Task Group 1 – BDQ Framework Convener: Allan Koch Veiga Task Group 2 – Tests and Assertions Standard Set of Tests Convener: Lee Belbin Task Group 3 – User Stories/Use Cases Convener Miles Nicholls Proposed Task Group 4 – Vocabularies

User Stories/Use Cases TG3 User Stories/Use Cases TG2 Standard Tests TG1 Framework TG4 (proposed) Vocabularies DQ Management Collections Databases of Record Data Custodians Implementations GBIF, iDigBio, ALA, VertNet OBIS, Kurator, etc.

Data Quality & Fitness for Use Use: What is the purpose or use for which data must have quality? Data: What kind of data are relevant and must have quality in the context of the Use? Fitness: What constitutes “fitness” for the relevant Data in the context of the Use?

Amendments 12/15/75 → 1975-12-15

Fitness : Data and Use occurrenceId: urn:uuid:a3…….. eventDate: 1970 Use: Phenology eventDate must have a resolution of a day or better. Not fit for this use. Use: Change in range over 100 year timescale eventDate must have a resolution of a year or better. Fit for this use.

Framework/TG Process Framework (TG1) DQ Assurance (filtering) GBIF TG3 User Story Use Case DQ Profile DQ Solutions DQ Report DQ Control (improvement) Tests TG2 GBIF, OBIS, iDigBio ALA, VertNet, Kurator Controlled Vocabularies TG4

Reporting on Data Quality Data Quality Validation Report Needs: Criterion: Date collected must be in ISO format. Information element: dwc:eventDate Dimension: Conformance Resource Type: Single Record Solutions Specification: dwc:eventDate parses as an ISO Date Mechanism: event_date_qc v1.0.3 Report Result: Compliant Status: Asserted Detail: “1884-10” is a valid ISO date.

User Stories/Use Cases TG3 User Stories/Use Cases TG2 Standard Tests TG1 Framework TG4 (proposed) Vocabularies DQ Management Collections Databases of Record Implementations GBIF, iDigBio, ALA, VertNet OBIS, Kurator, etc.

Why Controlled Vocabularies? What we find in the data: reproductiveCondition 40,838 distinct values lifeStage 33,402 distinct values country 80,408 distinct values (expected = 250) GBIF Distinct values: https://tinyurl.com/zhnnyy4 Courtesy of Tim Robertson

Controlled Vocabularies Who would benefit from having Controlled Vocabularies? Data producers (e.g collectors) could capture data using pick lists and could impart valuable information more efficiently. Data custodians (e.g., museum collections) could manage, provide and use data more efficiently. Controlled Vocabularies Data aggregators data quality assessment infstructure for data filtering. Data users more effective filtering and data discovery

(Tools come and go) Interoperability problem.

(Amendments, Validations, Measures)

A Test: is dwc:day in range 1-31? # 12a GUID 48aa7d66-36d1-4662-a503-df170f11b03f IDs Variable DAY_INVALID Description (warning/error) The value given for event day is less than 1 or greater than 31 Description (test - PASS) The value given for event day is between 1 and 31 Specification (Technical) day is <= 1 or => 31 Record Resolution SingleRecord Term Resolution SingleTerm Data Dependency Internal Output Type Validation Example day=32 Darwin Core Class Event Darwin Core Terms day DQ Dimension Conformance Severity Warning Source ALA

An example data record verbatimEventDate: 0/10/1973 day: 0 month:10 occurrenceID: urn:uuid:205f3f79-512a-44a8-be35-76eef7b89c5d verbatimEventDate: 0/10/1973 day: 0 month:10 year:1973 eventDate: 1973-10

Some tests Day In Range=NOT_COMPLIANT Month In Range=COMPLIANT EventDate Correctly Formatted=COMPLIANT EventDate precision calendar year or better =COMPLIANT Day Consistent With Month/Year =DATA_PREREQUISITES_NOT_MET day: 0 month:10 year:1973 eventDate: 1973-10

More Details Criterion in Context: Day In Range, Single Record Information Element: dwc:day DataResource: Single Record: urn:uuid:205f3f79-512a-44a8-be35-76eef7b89c5d Result: NOT_COMPLIANT Details: Provided value for day '0' is not an integer in the range 1 to 31. Mechanism: Kurator: DwCEventDQ v1.0.3

Kurator Visualization Month In Range COMPLIANT Provided value for month '10' is an integer in the range 1 to 12. EventDate precision Julian year or better. Provided value for eventDate '1973-10-01' has a duration less than or equal to one Julian year of 365.25 days. EventDate precision calendar year or better. Provided value for eventDate '1973-10-01' does not contain a leap day and has a duration less than or equal to one calendar year of 365 days. Day Consistent With Month/Year DATA_PREREQUISITES_NOT_MET Provided value for day 0 is outside the range 1-31, unable to test. Day In Range NOT_COMPLIANT Provided value for day '0' is not an integer in the range 1 to 31.

FP-Akka Data? Defect Source? Process? [TG4] DB Software? Goals? [TG3] Placopecten magellanicus Gmelin, 1791 FP-Akka (Gmelin, 1791) Placopecten magellanicus WAS: Gmelin, 1791; CHANGED TO: (Gmelin, 1791) Found accepted name Placopecten magellanicus Source: Catalog of Life. Authorship: Differs only in Parentheses Authorship Similarity: 0.833 Data? Defect Source? [TG4] Process? DB Software? Goals? [TG3] DQ Software? [TG2]

User Stories/Use Cases TG3 User Stories/Use Cases TG2 Standard Tests TG1 Framework TG4 (proposed) Vocabularies DQ Management Collections Databases of Record Implementations GBIF, iDigBio, ALA, VertNet OBIS, Kurator, etc.