Sebastian Neumaier Advisor: Univ.Prof. Dr. Axel Polleres Co-Advisor:

Slides:



Advertisements
Similar presentations
SDMX in the Vietnam Ministry of Planning and Investment - A Data Model to Manage Metadata and Data ETV2 Component 5 – Facilitating better decision-making.
Advertisements

Cloud platforms Lead to Open and Universal access for people with Disabilities and for All WP Federating repositories of Solutions.
Elizeu Santos-Neto, Flavio Figueiredo Jussara Almeida, Miranda Mowbray Marcos Gonçalves, Matei Ripeanu The 2 nd IEEE SocialCom/SIN -- August 2010.
Title Course opinion mining methodology for knowledge discovery, based on web social media Authors Sotirios Kontogiannis Ioannis Kazanidis Stavros Valsamidis.
Web Standards and Technical Challenges for Publishing and Processing Data on the Web Axel Polleres web:
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
The KB on its way to Web 2.0 Lower the barrier for users to remix the output of services. Theo van Veen, ELAG 2006, April 26.
1 Kharkiv National University of Radioelectronics, Ukraine Ontology-Based Portal for National Educational and Scientific Resources Management Masha Klymova.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Data Management: Documentation & Metadata Types of Documentation.
7-Aug-15 Serialization and XML Pat Palmer What is serialization? “The process of converting an object (or a graph of objects) into a linear sequence.
Data Sets, Vocabularies and Tools Pablo N. Mendes Freie Universität Berlin 1st year review Luxembourg, December /02/11.
How can you use Open Data? ... And why you should!
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
What are research data? July 2015 This work is licensed under a Creative Commons Attribution 4.0 International LicenseCreative Commons Attribution 4.0.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
OpenCoesione Transparency and civic monitoring on Cohesion Policy Simona De Luca Evaluation Unit – Department of Development and Economic Cohesion (DPS),
Annick Le Follic Bibliothèque nationale de France Tallinn,
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
1 XML as a preservation strategy Experiences with the DiVA document format Eva Müller, Uwe Klosa Electronic Publishing Centre Uppsala University Library,
Annual reports and feedback from UMLS licensees Kin Wah Fung MD, MSc, MA The UMLS Team National Library of Medicine Workshop on the Future of the UMLS.
2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
CLARIN work packages. Conference Place yyyy-mm-dd
Auditing Grey in a CRIS Environment
 An article review is written for an audience who is knowledgeable in the subject matter instead of a general audience  When writing an article review,
KAnOE: Research Centre for Knowledge Analytics and Ontological Engineering Managing Semantic Data NACLIN-2014, 10 Dec 2014 Dr. Kavi Mahesh Dean of Research,
© Copyright 2015 STI INNSBRUCK PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.
Linked Open Data for European Earth Observation Products Carlo Matteo Scalzo CTO, Epistematica epistematica.
Semantic Web Overview Diane Vizine-Goetz OCLC Research.
Semantic Web. P2 Introduction Information management facilities not keeping pace with the capacity of our information storage. –Information Overload –haphazardly.
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
SysML v2 Model Interoperability & Standard API Requirements Axel Reichwein Consultant, Koneksys December 10, 2015.
Geospatial metadata Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
GCI Architecture GEOSS Information System Meeting 20 September 2013, ESA/ESRIN (Frascati, Italy) M.Albani (ESA), D.Nebert (USGS/FGDC), S.Nativi (CNR)
Yannis Ioannidis, Professor Evita Mailli University of Athens Dept. of Informatics & Telecom. MaDgIK Lab.
NRF Open Access Statement
Bob Jones EGEE Technical Director
The Semantic Web By: Maulik Parikh.
Activities in a nutshell
First Light for DOIs at ESO
Usage scenarios, User Interface & tools
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Training course on biodiversity data publishing and fitness-for-use in the GBIF Network, 2011 edition How Darwin Core Archives have changed the landscape.
Yesterday in a talk this slide was presented.
Publishing software and data
Web Engineering.
VI-SEEM Data Repository
Lifting Data Portals to the Web of Data
Geospatial Knowledge Base (GKB) Training Platform
Metadata Quality: Learning from Open Data Portalwatch
Scalable Policy-awarE Linked Data arChitecture for prIvacy, trAnsparency and compLiance H2020-ICT Big Data PPP: privacy-preserving Big Data technologies.
Building an Open Knowledge Graphs for and from Open Data
B2FIND Integration and Usage
ESSnet Linked Open Statistics Update
CSc4730/6730 Scientific Visualization
EUDAT B2FIND A Cross-Discipline Metadata Service and Discovery Portal
Indicator structure and common elements for information flow
Google Dataset Search Evaluation
LOD reference architecture
Datasets in CRM Site Proposal
WebDAV Design Overview
JISC Information Environment Service Registry (IESR)
MSDI training courses feedback MSDIWG10 March 2019 Busan
Data Management Components for a Research Data Archive
AZ-900 Exam Braindumps
Linked Data Ryan McAlister.
QoS Metadata Status 106th OGC Technical Committee Orléans, France
SDMX Global Conference Francesco Rizzo – ISTAT, Italy
Presentation transcript:

Open Data Quality Assessment and Evolution of (Meta-)Data Quality in the Open Data Landscape Sebastian Neumaier seb.neumaier@gmail.com Advisor: Univ.Prof. Dr. Axel Polleres Co-Advisor: Dr. Jürgen Umbrich

Contents Preliminaries: Open Data Landscape and Portals Problem Statement and Motivation Quality Metrics Automated Quality Assessment Framework Findings Conclusion and Future Work

What is Open Data? Freely available data, published in an open and machine readable format which allows everybody to do everything without restrictions at anytime open access, preferable on the WWW e.g., CSV, JSON, RDF open license which allows use, reuse, modification, redistribution Open Knowledge Foundation: founded in 2004 in GB Goals: also Open Science private, non-commercial and commercial 24/7 See more at: http://opendefinition.org/okd/

The Open Data Landscape Cities, International Organizations, National and European Portals: Socrata GB leading in europe, but also multiple Austrian data portals TODO group by CKAN other data management systems

Open Data Portals Single point of access Meta data Typical software Licenses Provenance Formats … Typical software Open Data Portal CSV title license ... JSON XML Dataset CSV CSV CSV Resource CKAN: OpenSource portal by OKF Socrata: A company offering portal software and hosting of data, mainly in US OpenDataSoft: small France-based company, very similar to Socrata

E.g.: data.gv.at Open Data Portal by the Austrian Government

CKAN Metadata (JSON) core keys resource keys extra keys TODO d: { "license_title": "Creative Commons Namensnennung", "maintainer": "Stadtvermessung Graz", "author": "", "author_email": "stadtvermessung@stadt.graz.at", "resources": [ { "size": "6698", "format": "CSV", "mimetype": "", "url": "http://data.graz.gv.at/.../Bibliothek.csv" } ], "tags": [ "bibliothek", "geodaten", "graz", "kultur", "poi" ], "license_id": "CC-BY-3.0", "organization": null, "name": "bibliotheken", "notes": "Standorte der städtischen Bibliotheken...", "extras": { "Sprache des Metadatensatzes": "ger/deu Deutsch" }, "license_url": "http://creativecommons.org/.../by/3.0/at/", core keys resource keys TODO extra keys

What is the Problem? Metadata Resources There is a concern of quality issues on data portals [1]: Metadata Missing values Incorrect values No contact info Wrong/missing file format description Resources Changing URLs Formats (e.g. CSV not RFC 4180 compliant -> [,;\t#]) Encoding (e.g., mixed) [1] http://www.business2community.com/big-data/open-data-risk-poor-data-quality-01010535

Hypothesis Objective Quality Metrics discover, point out and measure quality and heterogeneity issues in data portals Automated Quality Assessment Framework monitor and assess the evolution of quality metrics over time Approach: Find a set of suitable quality metrics Monitor the quality of data portals over time Outcome: Current quality of open data Find out development and growth rate of open data Find out impact of improvement initiatives

Quality Metrics

Metrics The extent to which meta data and resources can be retrieved. Dimensions Description Retrievability The extent to which meta data and resources can be retrieved. Usage The extent to which available meta data keys are used to describe a dataset. Completeness The extent to which the used meta data keys are non empty. Accuracy The extent to which certain meta data values accurately describe the resources. Openness The extent to which licenses and file formats conform to the open definition. Contactability The extent to which the data publisher provide contact information. Objective measures which can be automatically computed in a scalable way

Concrete Metrics (1/2) Retrievability: Usage: Completeness: HTTP GET lookup for datasets (API) and resources Usage: Ratio of used keys and all identified keys (on a data portal) Completeness: Ratio of non-empty keys in a dataset

Concrete Metrics (2/2) Openness: Contactability: Accuracy: Licenses: map to list by opendefinition.org Formats: pre-defined set of file formats, e.g. CSV, XML, … Contactability: Availability of contact information: (i) text, (ii) url, (iii) email Accuracy: Formats, file size, mime-type Currently based on respective HTTP response header fields

Automated QA Framework

Architecture CKAN CKAN Dashboard (nodejs) Reporting Dumps (json) CKAN Socrata OpenDataSoft MongoDB Meta data harvester Quality Assessment Harvester stores raw metadata in mongodb Resource-harvester performs HTTP header lookups on resource URLs QA component calculates quality metrics Resource harvester HTTP HEAD

Open Data Portal Watch Scalable quality assessment & monitoring framework for Open Data Portals http://data.wu.ac.at/portalwatch/

Findings

Portals Overview Based on 126 CKAN data portals: Top 5 (wrt. datasets): 3.12M URL values, 1.92M distinct, 1.91M are syntactically valid URLs 1.1M Content-Length HTTP header fields resulting in 12.297 TB

Portal Overlap 13% (260K) of the unique resources appear in more than one dataset 12% (227K) resources in more than one portal biggest portals act as parent/harvester portals (e.g. data.gov, publicdata.eu)

Retrievability

Openness Top 10 licenses and formats over all portals: confirmed open Future Work: distinction between Open and Machine-Readable formats confirmed open

Contactability Contact information in form of URLs, email adresses, or any value very few URLs 35% of the portals with very good contractibility 25% with hardly any contact values

Conclusion Main findings (126 CKAN Portals): High metadata heterogeneity for portal specific keys/tags Low confirmed openness (wrt. licenses and formats) About 80% resource retrievability Only 35% of the portals have a high contactability

Impact Peer Reviewed Publications Follow-up Project: “ADEQUATe” [1] Jürgen Umbrich, Sebastian Neumaier, and Axel Polleres. Quality assessment & evolution of open data portals. In IEEE International Conference on Open and Big Data, Rome, Italy, August 2015. Jürgen Umbrich, Sebastian Neumaier, and Axel Polleres. Towards assessing the quality evolution of open data portals. In ODQ2015: Open Data Quality: from Theory to Practice Workshop, Munich, Germany, March 2015. Follow-up Project: “ADEQUATe” [1] develop and evaluate mechanisms to measure, monitor and improve data quality in Open Data In cooperation with WU, Danube University Krems and Semantic Web Company [1] http://www.adequate.at/

Current and Future Work Adequate: FFG project to develop and evaluate mechanisms to measure, monitor and improve data quality in Open Data. Paper in progress: generalized qa framework

Towards a general QA Framework More Open Data Portals: Harvest data from other portal frameworks, e.g. Socrata, OpenDataSoft, … Metadata Homogenization: Map metadata keys from different frameworks to the RDF-based DCAT [1] DCAT specific Quality Dimensions: E.g., Existence and conformance of access, license or file format information. [1] http://www.w3.org/TR/vocab-dcat/

Thank you for your attention.

Backup Slides

Usage & Completeness Avg. usage and completeness for different keys per portal core and resource keys are well established extra keys can be grouped (usage) Core keys „quite“ complete Portals with „unused“ extra keys (completeness)

Accuracy Datasets with metadata: 27K size 252K mime type 625K format HTTP HEAD 1.64M response header 1.55M 94.5% content-type 1.4M 85.4% content-length 1.1M 67% Datasets with metadata: 27K size 252K mime type 625K format

Formal Metrics (1/4) Retrievability: Usage:

Formal Metrics (2/4) Completeness:

Formal Metrics (3/4) Accuracy: Openness:

Formal Metrics (4/4) Contactability:

Portals Detail

Austrian Data Portals Evolution of datasets and quality metrics data.gv.at as harvesting portal