Quality Assessment Methodologies for Linked Open Data Aluno: Walter Travassos Sarinho Ontologias e Web Semântica Profs.: Fred Freitas e.

Slides:



Advertisements
Similar presentations
2 Introduction A central issue in supporting interoperability is achieving type compatibility. Type compatibility allows (a) entities developed by various.
Advertisements

Auditing Concepts.
UNDERSTANDING DATA QUALITY 1. Data quality dimensions in the literature  include dimensions such as accuracy, reliability, importance, consistency, precision,
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
SASQAF South African Statistical Quality Assessment Framework
Discussion on SA-500 – AUDIT EVIDENCE
Basics of Knowledge Management ICOM5047 – Design Project in Computer Engineering ECE Department J. Fernando Vega Riveros, Ph.D.
OASIS Reference Model for Service Oriented Architecture 1.0
2006 Ontopia AS1 Towards an Ontology Design Methodology Initial work Lars Marius Garshol CTO, Ontopia TMRA
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
Completeness February 27, 2006 Geog 458: Map Sources and Errors.
1 Software Requirements Specification Lecture 14.
Systems Engineering Foundations of Software Systems Integration Peter Denno, Allison Barnard Feeney Manufacturing Engineering Laboratory National Institute.
Conceptual modelling. Overview - what is the aim of the article? ”We build conceptual models in our heads to solve problems in our everyday life”… ”By.
UNDERSTANDING DATA QUALITY 1. Philosophical Position and Important Definitions 2.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
R 255 G 211 B 8 R 255 G 175 B 0 R 127 G 16 B 162 R 163 G 166 B 173 R 137 G 146 B 155 R 175 G 0 B 51 R 52 G 195 B 51 R 0 G 0 B 0 R 255 G 255 B 255 Primary.
The use and convergence of quality assurance frameworks for international and supranational organisations compiling statistics The European Conference.
Dr. Alireza Isfandyari-Moghaddam Department of Library and Information Studies, Islamic Azad University, Hamedan Branch
Quality and Reliability of CRIS data A case for euroCRIS? euroCRIS Membership Meeting November 1 – 2, 2007, Vienna Maximilian Stempfhuber GESIS–IZ Social.
Managing Software Quality
The Data Attribution Abdul Saboor PhD Research Student Model Base Development and Software Quality Assurance Research Group Freie.
SC32 WG2 Metadata Standards Tutorial Metadata Registries and Big Data WG2 N1945 June 9, 2014 Beijing, China.
Ontology Development Kenneth Baclawski Northeastern University Harvard Medical School.
Sept - Dec w1d11 Beyond Accuracy: What Data Quality Means to Data Consumers CMPT 455/826 - Week 1, Day 1 (based on R.Y. Wang & D.M. Strong)
Deploying Trust Policies on the Semantic Web Brian Matthews and Theo Dimitrakos.
January 29, 2010ART Beach Retreat ART Beach Retreat 2010 Assessment Rubric for Critical Thinking First Scoring Session Summary ART Beach Retreat.
© Grant Thornton | | | | | Guidance on Monitoring Internal Control Systems COSO Monitoring Project Update FEI - CFIT Meeting September 25, 2008.
A Taxonomy of Evaluation Approaches in Software Engineering A. Chatzigeorgiou, T. Chaikalis, G. Paschalidou, N. Vesyropoulos, C. K. Georgiadis, E. Stiakakis.
Outcome Based Evaluation for Digital Library Projects and Services
2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2.
 Remember, it is important that you should not believe everything you read.  Moreover, you should be able to reject or accept information based on the.
ISO 9001:2008 to ISO 9001:2015 Summary of Changes
LOGIC AND ONTOLOGY Both logic and ontology are important areas of philosophy covering large, diverse, and active research projects. These two areas overlap.
Evaluation of information. Introduction It is common for people to challenge things they learn It is known that not every information is true Medical.
A Context Model based on Ontological Languages: a Proposal for Information Visualization School of Informatics Castilla-La Mancha University Ramón Hervás.
10/24/09CK The Open Ontology Repository Initiative: Requirements and Research Challenges Ken Baclawski Todd Schneider.
© 2010 Health Information Management: Concepts, Principles, and Practice Chapter 5: Data and Information Management.
Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Towards a Reference Quality Model for Digital Libraries Maristella Agosti Nicola Ferro Edward A. Fox Marcos André Gonçalves Bárbara Lagoeiro Moreira.
United Nations Oslo City Group on Energy Statistics OG7, Helsinki, Finland October 2012 ESCM Chapter 8: Data Quality and Meta Data 1.
+ Karin Becker Instituto de Informática - Federal University of Rio Grande do Sul, Brazil Shiva Jahangiri, Craig A. Knoblock Information Sciences Institute,
Semantic Web Knowledge Fusion Jennifer Sleeman University of Maryland, Baltimore County Motivation Definitions Methodology Evaluation Future Work Based.
JRA1.4 Models for implementing Attribute Providers and Token Translation Services Andrea Biancini.
Copyright 2010, The World Bank Group. All Rights Reserved. Recommended Tabulations and Dissemination Section B.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
Semantic Data Extraction for B2B Integration Syntactic-to-Semantic Middleware Bruno Silva 1, Jorge Cardoso 2 1 2
© Copyright 2015 STI INNSBRUCK PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.
A realistic look at open government data Sharon Dawes.
Organizations of all types and sizes face a range of risks that can affect the achievement of their objectives. Organization's activities Strategic initiatives.
Design Evaluation Overview Introduction Model for Interface Design Evaluation Types of Evaluation –Conceptual Design –Usability –Learning Outcome.
Introduction to Compliance Auditing
CAPE INFORMATION TECHNOLOGY
Auditing Concepts.
The Systems Engineering Context
Data Quality Assurance in Cooperative Information Systems: a Multi-dimension Quality Certificate Cinzia Cappiello1, Chiara Francalanci1, Barbara Pernici1,
Big Data Quality the next semantic challenge
Ontology Evolution: A Methodological Overview
EER Assurance Presentation of Issues and Project Update June 2018
Chapter 3 Evaluating Information
Measuring Data Quality and Compilation of Metadata
Rafael Almeida, Inês Percheiro, César Pardo, Miguel Mira da Silva
Big Data Quality the next semantic challenge
Sub-Regional Workshop on International Merchandise Trade Statistics Compilation and Export and Import Unit Value Indices 21 – 25 November Guam.
CAPE INFORMATION TECHNOLOGY
Big Data Quality the next semantic challenge
Organizational Aspects of Data Management
Presentation transcript:

Quality Assessment Methodologies for Linked Open Data Aluno: Walter Travassos Sarinho Ontologias e Web Semântica Profs.: Fred Freitas e Bernadette Farias Lóscio

Summary Introduction What is Data Quality? The Data Quality Problem Quality Dimensions Relation between dimensions Conclusions

Introduction Overview on Quality Criteria under Linked Data Sources. Web of Data comprises close to 50 billion facts represented as RDF triples [Zaveri et al. 2012] Encouraging further experimentation and the development of new approaches focused towards data quality.

What is Data Quality? Fitness to use [Wang et al. 1996] [Juran 1974] [Knight and Burn 2005] The key challenge is to determine the data quality wrt. a particular use case.

The Data Quality Problem Dataset published on Web of Data already cover a diverse set of domains, however, the datas on this environment reveals a largue variation in data quality. For example, data extracted from semi- structured or even unstructed sources, such as Dbpedia, often contains inconsistencies as well as misrepresented and incomplete information. [Zaveri et al. 2012].

The Data Quality Problem There are already many methodologies and frameworks available for assessing data quality, all addressing different aspects of this task by proposing appropriate methodologies, measures and tools. [...] However, data quality on the Web of Data also includes a number of novels aspects, such as coherence via links to external datasets, data representation quality or consistency with regard to implicit information [Zaveri et al. 2012].

The Data Quality Problem Despite the quality in LOD being an essential concept, few efforts are currently in place to standardize how quality tracking and assurance should be implemented. Moreover, there is no consensus on how the data quality dimensions and metrics should be defined [Zaveri et al. 2012].

The Data Quality Problem [Bizer et al 2009] Web-based information systems which integrate information from different providers. [Mendes et al 2012] the problem is related to values being in conflict between different data sources as a consequence of the diversity of the data. [Flemming 2010] does not a provide definition but implicicty explains the problem in terms os data diversity. [Hogan et al 2012] the authors discuss about errors, noise or difficulties to access the information.

Data Quality Dimensions and Metrics Dimensions are rather abstract concepts. The assessment metrics rely on quality indicators that allow for the assessment of the quality of a data source wrt. the criteria. [Flemming 2010]

Data Quality Assessment Method A data quality assessment methodology is defined as the process of evaluating if a piece of data meets in the information consumers need in a specific use case [Bizer and Cyganiak 2009]. The process involves measuring the quality dimensions that are relevant to the user and comparing the assessment results with the users quality requirements.

Based on this... There was a systematic review that generated 118 relevant articles between 2002 and 2012 that focus towards assessing the quality of Linked Open Data. Analyzing these 118 retrieved articles, a total of 21 papers were found relevant for this survey.

the results were data quality dimensions (criteria) that can be applied to assess the quality of Linked Data sources. The dimensions were grouped following a classification introduced in [Wang and Strong 1996] wich is further modified and extended as:

Contextual Dimensions Trust Dimensions Intrinsic Dimensions Accessibility Dimensions Representational Dimensions Dataset Dynamicity

Contextual Dimensions Are those that highly depend on the context of the task at hand as well as on the subjective preferences of the data consumer. Completeness Amount-of-data Relevancy

Completeness The degree to which information is not missing [Bizer 2007]. ▫Schema Completeness ▫Data Completeness

Amount-of-data Refers to the quantity and volume of data that is appropriate for a particular task.

Relevance Refers to the provision of information which is in accordance with the task at hand and importance to the users’ query. The extent to which information is applicable and helpful for the task at hand [Bizer 2007].

Relations between dimensions... For instance, if a dataset is incomplete for a particular purpose, the amount of data is often insufficient. However, is the amount of data is too large, it could be that irrelevant data is provided, which affects the relevance dimension.

Trust Dimensions Focus on the trustworthiness of the dataset. Provenance Verifiability Reputation Believability Licensing

Provenance Refers to the contextual metadata that focuses on how to represent, manage and use information about the origin of the source. Provenance helps to describe entities to enable trust, assess authenticity and allow reproducibility.

Verifiability Refers to the degree by which a data consumer can assess the correctness of a dataset and, as a consequence, its trustworthiness.

Reputation Is a judgement made by a user to determine the integrity of a source. It is mainly associated with a data publisher, a person, a organism, group of people or community of practice rather than being a characteristic of a dataset. The data publisher should be identifiable for a certain (part of a) dataset. Is usually a score between 0 (low) and 1 (high).

Believability Is defined as the degree to which the information is accepted to be correct, true, real and credible. Can be subjectly mensured by analysing the provenance information of the dataset. Can be seen as expected accuracy.

Licensing Is defined as a granting of the permission for a consumer to re-use a dataset under defined conditions.

Relations between dimensions... Verifiability is related to the believability dimension but, differs from it because even though verification can find whether information is correct or incorrect, belief is the degree to which a user thinks an information is correct.

Relations between dimensions... The provenance information associated with a dataset assists a user in verifying the believability of a dataset. Therefore, believability is affiliated to the provenance of a dataset. Licensing is also part of the provenance of a dataset and contributes towards ist believability.

Relations between dimensions... Highly dependent on the context of the task at hand.

Intrinsinc Dimensions Are those that independent of the user's context. These dimensions focus on whether information correctly represents the real world and wheter information is logically consistent in itself. Accuracy Objectivity Validity-of-documents Interlinking Consistency Conciseness

Accuracy Can be defined as the extent to which data is correct, that is, the degree to which it correctly represents the real world facts and is also free of error. In particular, we associate accuracy mainly to semantic accuracy which relates to the correctness of a value to the actual real world value, that is, accuracy of the meaning.

Objectivity Is defined as the degree to which the interpretation and usage of data is unbiased, unprejudiced and impartial. This dimensions highly depends on the type of information and therefore is classified as a subjective dimension.

Validity-of-documents Refers to the valid usage of the underlying vocabularies and the valid syntax of the documents (syntactic accuracy).

Interlinking Refers to the degree to which entities that represent the same concept are linked to each other.

Concistency Means that a knowledge base is free of (logical / formal) contradictions wrt. a particular knowledge representation and inference mechanisms. A more generic definition is that "a dataset is consistent if it is free of conflicting information" [Mendes 2012].

Conciseness Refers to the redundance of entities, be it at the schema or the data level. Thus, conciseness can be classified into (i) intensional conciseness (schema level) which refers to the redundant attributes (e. g. two equivalent attributes with different names) and (ii) extensional conciseness (data level) which refers to redundant objects(e. g. two equivalent objects with different identifiers).

Relations between dimensions... Objectivity overlaps with the concept of accuracy but differs from it because the concept of accuracy does not heavily depend on the consumers’ preference.

Relations between dimensions... Objectivity is also related to the verifiability dimension, that is, the more verifiable a source is, the more objective it will be. Accuracy, in particular the syntactic accuracy of the documents, is related to the validity-of- documents dimension.

Accessibility Dimensions Dimensions belonging to this category involve aspects related to the way data can be accessed and retrieved. Availability Performance Security Response-time

Availability Availability of a dataset is the extent to which information is present, obtainable and ready for use.

Performance Refers to the efficient of a system that binds to a large dataset, that is, the more performant a data source the more efficiently a system can process data.

Security Refers to "the possibility to restrict access to the data and to guarantee the confidentiality of the communication between a source and its consumers" [Flemming 2010].

Response-time Measures the delay, usually in seconds, between submission of a query by the user and reception of the complete response from the dataset.

Relations between dimensions... Performance of a system is related to the availability and response-time dimensions. Only if a dataset is available or has low response time, it can perform well. Security is also related to the availability of a dataset because the methods used for restricting users is tied to the way a user can access a dataset.

Representational Dimensions Capture aspects related to the desing of the data. Representational-consiness Representational-consistency Understandibility Interpretability Versatility

Representational-consiness Refers to the representation of the data which is compact and well formatted on the one hand but also clear and complete on the other hand. "The extent to which information is compactly represented" [Bizer 2007].

Representational-consistency The extent to which information is represented in the same format [Bizer 2007]. Can be assessed by detecting whether the dataset re-uses existing vocabularies or terms from existing established vocabularies to represent its entities.

Understandability Refers to the ease with which data can be comprehended, without ambiguity, and used by a human consumer. Thus, this dimension can also be referred to as the comprehensibility of the information where the data should be of sufficient clarity in order to be used.

Interpretability Refers to technical aspects of data, that is whether information is represented using an appropriate notation and whether it conforms to the technical ability of the consumer. Is defined as the "extent to which information is in appropriate languages, symbols, and units, and the definitions are clear" [Bizer 2007].

Versatility Mainly refers to the alternative representations of data and its subsequent handling. Additionally, versatility also corresponds to the provision of alternative access methods for a dataset. Is defined as "the alternative representations of the data and its handling" [Flemming 2010].

Relations between dimensions... Understandability is related to the interpretability dimension as it refers to the subjective capability of the information consumer to comprehend information. Interpretability mainly refers to the technical aspects of the data, that is if it is correctly represented. Versatility is also related to the interpretability of a dataset as the more versatile forms a dataset is represented in (for e. g. in different languages), the more interpretable a dataset is.

Relations between dimensions... Representational-consistency is related to the validity-of-documents, an intrinsic dimension, because the invalid usage of vocabularies may lead to inconsistency in the documents.

Dataset Dynamicity An important aspect of data is its update over time. The main dimensions related to the dynamicity of a dataset are: Currency Volatility Timeliness

Currency Refers to the speed with which the information (state) is updated after the real-world information changes.

Volatility Can be defined as the lenght of time during which the data remains valid.

Timeliness Refers to the time point at which the data is actually used. This can be interpreted as whether the information is available in time to be useful.

Relations between dimensions... Timeliness depends on both the currency and volatility dimensions. Currency is related to volatility since currency of data depends on how volatile it is. Thus, data that is highly volatile must be current, while currency is less important for data with low volatility.

Conclusion There are many ways to assess the quality of a dataset. This work is a initial study about the state of art of quality criteria for linked data sources. Evaluate quality of a data source should always take into consideration the quality requirements of the user.

References [Bizer 2007] BIZER, C. Quality-Driven Information Filtering in the Context of Web-Based Information Systems. PhD thesis, Freie Universität Berlin, March [Bizer and Cyganiak 2009] BIZER, C., AND CYGANIAK, R. Quality-driven information filtering using the wiqa policy framework. Web Semantics 7, 1 (Jan 2009), 1 – 10. [Flemming 2010] FLEMMING, A. Quality characteristics of linked data publishing datasources. Master’s thesis, Humboldt- Universität zu Berlin, [Hogan 2010] HOGAN, A., HARTH, A., PASSANT, A., DECKER, S., AND POLLERES, A. Weaving the pedantic web. In LDOW (2010). [Juran 1974] JURAN, J. The Quality Control Handbook. McGraw-Hill, New York, [Knight and Burn 2005] KNIGHT, S., AND BURN, J. Developing a framework for assessing information quality on the world wide web. Information Science 8 (2005), 159 – 172. [Mendes 2012] MENDES, P., MÜHLEISEN, H., AND BIZER, C. Sieve: Linked data quality assessment and fusion. In LWDM (March 2012). [Wang and Strong 1996] WANG, R. Y., AND STRONG, D. M. Beyond accuracy: what data quality means to data consumers. Journal of Management Information Systems 12, 4 (1996), 5–33. [Wang 1996] WAND, Y., AND WANG, R. Y. Anchoring data quality dimensions in ontological foundations. Communications of the ACM 39, 11 (1996), 86–95. [Zaveri et al 2012] ZAVERI, A., RULA, A., MAURINO, A., PIETROBON, R., et al. Quality Assessment Methodologies for Linked Open Data. Semantic-Web journal 2012

Obrigado!!