Big Data Quality the next semantic challenge

Slides:



Advertisements
Similar presentations
Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.
Advertisements

Improving Learning Object Description Mechanisms to Support an Integrated Framework for Ubiquitous Learning Scenarios María Felisa Verdejo Carlos Celorrio.
FT228/4 Knowledge Based Decision Support Systems Knowledge Engineering Ref: Artificial Intelligence A Guide to Intelligent Systems, Michael Negnevitsky.
CH-4 Ontologies, Querying and Data Integration. Introduction to RDF(S) RDF stands for Resource Description Framework. RDF is a standard for describing.
Haystack: Per-User Information Environment 1999 Conference on Information and Knowledge Management Eytan Adar et al Presented by Xiao Hu CS491CXZ.
Rulebase Expert System and Uncertainty. Rule-based ES Rules as a knowledge representation technique Type of rules :- relation, recommendation, directive,
Data Science for Business: Semantic Verses Dr. Brand Niemann Director and Senior Data Scientist Semantic Community
Copyright Irwin/McGraw-Hill Data Modeling Prepared by Kevin C. Dittman for Systems Analysis & Design Methods 4ed by J. L. Whitten & L. D. Bentley.
Enhancing Data Quality of Distributive Trade Statistics Workshop for African countries on the Implementation of International Recommendations for Distributive.
Frank Yu Australian Bureau of Statistics Unstructured Data 1.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Sensemaking and Ground Truth Ontology Development Chinua Umoja William M. Pottenger Jason Perry Christopher Janneck.
Behshid Behkamal Ferdowsi University of Mashhad Web Technology Lab.
Creating Architectural Descriptions. Outline Standardizing architectural descriptions: The IEEE has published, “Recommended Practice for Architectural.
EXPERT SYSTEMS Part I.
Quality-driven Integration of Heterogeneous Information System by Felix Naumann, et al. (VLDB1999) 17 Feb 2006 Presented by Heasoo Hwang.
Page 1Prepared by Sapient for MITVersion 0.1 – August – September 2004 This document represents a snapshot of an evolving set of documents. For information.
Quality Assessment Methodologies for Linked Open Data Aluno: Walter Travassos Sarinho Ontologias e Web Semântica Profs.: Fred Freitas e.
Semantic Web Technologies Lecture # 2 Faculty of Computer Science, IBA.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Module 3: Business Information Systems Chapter 11: Knowledge Management.
Quality and Reliability of CRIS data A case for euroCRIS? euroCRIS Membership Meeting November 1 – 2, 2007, Vienna Maximilian Stempfhuber GESIS–IZ Social.
Katanosh Morovat.   This concept is a formal approach for identifying the rules that encapsulate the structure, constraint, and control of the operation.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2.
SWETO: Large-Scale Semantic Web Test-bed Ontology In Action Workshop (Banff Alberta, Canada June 21 st 2004) Boanerges Aleman-MezaBoanerges Aleman-Meza,
RCDL Conference, Petrozavodsk, Russia Context-Based Retrieval in Digital Libraries: Approach and Technological Framework Kurt Sandkuhl, Alexander Smirnov,
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Dimitrios Skoutas Alkis Simitsis
European Conference on Quality in Official Statistics Session 26: Quality Issues in Census « Rome, 10 July 2008 « Quality Assurance and Control Programme.
Definition of a taxonomy “System for naming and organizing things into groups that share similar characteristics” Taxonomy Architectures Applications.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Majid Sazvar Knowledge Engineering Research Group Ferdowsi University of Mashhad Semantic Web Reasoning.
Trustworthy Semantic Webs Dr. Bhavani Thuraisingham The University of Texas at Dallas Lecture #4 Vision for Semantic Web.
Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar
ESIP Semantic Web Products and Services ‘triples’ “tutorial” aka sausage making ESIP SW Cluster, Jan ed.
7 Strategies for Extracting, Transforming, and Loading.
Clinical research data interoperbility Shared names meeting, Boston, Bosse Andersson (AstraZeneca R&D Lund) Kerstin Forsberg (AstraZeneca R&D.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
© Copyright 2015 STI INNSBRUCK PlanetData D2.7 Recommendations for contextual data publishing Ioan Toma.
The Concepts of Business Intelligence Microsoft® Business Intelligence Solutions.
© 2017 by McGraw-Hill Education. This proprietary material solely for authorized instructor use. Not authorized for sale or distribution in any manner.
Introduction to Machine Learning, its potential usage in network area,
Quality declarations Study visit from Ukraine 19. March 2015
Introduction To DBMS.
The Semantic Web By: Maulik Parikh.
Introduction Multimedia initial focus
Overview of MDM Site Hub
It’s All About Me From Big Data Models to Personalized Experience
Big Data Quality Identity in Linked Data
Big Data Quality the next semantic challenge
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
Database Vocabulary Terms.
Knowledge Representation
ece 720 intelligent web: ontology and beyond
A Schema and Instance Based RDF Dataset Summarization Tool
Stop Data Wrangling, Start Transforming Data to Intelligence
Measuring Data Quality and Compilation of Metadata
Property consolidation for entity browsing
Chapter 1 Database Concepts.
Chapter 11 user support.
Sub-Regional Workshop on International Merchandise Trade Statistics Compilation and Export and Import Unit Value Indices 21 – 25 November Guam.
Information Networks: State of the Art
About Thetus Thetus develops knowledge discovery and modeling infrastructure software for customers who: Have high value data that does not neatly fit.
The ultimate in data organization
Subject Name: SOFTWARE ENGINEERING Subject Code:10IS51
Topic: Semantic Text Mining
Big Data Quality the next semantic challenge
Organizational Aspects of Data Management
Database Design Chapter 7.
Presentation transcript:

Big Data Quality the next semantic challenge Maria Teresa PAZIENZA a.a. 2017-18

(BIG) DATA IS ONLY AS USEFUL AS ITS QUALITY

Introduction (Since Big Data is big and messy), challenges can be classified into engineering tasks (managing data at an unumaginable scale) and semantics (finding and meaningfully combining information that is relevant to your needs)

Challenges for Big Data Identify relevant pieces of information in messy data. Named entity resolution (event extraction in tweets –short texts) Coreference resolution (if 2 mentions refer to each other-indexing billions of RDF triples -data formats easy to use RDF/RDFS, OWL) Information extraction (difficult to scale) Paraphrase resolution (it aims at identifying an entry in a given knowledge base to which an entity-mention-in-a-document refers) Ontology population entity consolidation (organizing extracted tuples in a quering form such as instances of ontologies, tuples of a database for schema or set of quads –subject, predicate, object, context-)

Basic assumptions Datasets published on the web of data cover a diverse set of domains Data on the web reveals a large variation in data quality . Data extracted from semi-structured sources –Dbpedia etc.- often contain inconsistencies as well as misrepresented and incomplete information Even datasets with quality problems might be useful for certain applications as long as the quality is in the required range (in different application contexts)

Variety as a Big Data issue Variety as a Big Data issue is distinct in that established small scale methods are insufficient. The Big Data notion of variety is a generalization of semantic heterogeneity as studied in the field of databases, artificial intelligence, semantic web and cognitive science since many years.

Quality on the Web specific aspects Coherence via links to external datasets Data representation quality Consistency with regard to implicit information (inference mechanisms for knowledge representation formalisms on the web -owl- usually follow an open world assumption, whereas databases usually adopt closed world semantics) Ontology quality No consensus on how data quality dimensions and metrics should be defined

Quality on the Web specific aspects The challenges are related to openness of the web of data, diversity of the information and unbound, dynamic set of autonomous data sources and publishers.

Dimensions of data quality Organized into two categorie: contextual, referring to attributes that are dependent on the context in which the data are observed or used, and intrinsic, referring to attributes that are objective and native to the data.

Contextual dimensions of data quality Include at least relevancy, value added , quantity, believability, accessibility, understandibility, availability, verifiability and reputation of the data. Contextual dimensions of data quality lend themselves more towards information as opposed to data because these dimensions are formed by placing data within a situation or problem specific context.

Intrinsic dimensions of data quality Intrinsic data quality has 4 dimensions: Accuracy (degree to which data are equivalent to their corresponding «real» values) Timeliness (degree to which data are up-to-date: currency or lenght of time since the record’s last update, volatility which describes the frequency of updates) Consistency (degree to which related- data- records -match in terms of format and structure) Completeness (degree to which data are full and complete in content, with no missing data) Es: indirizzo

Intrinsic dimensions of data quality Data quality dimension Description Supply chain example Accuracy Are the data free of errors? Customer shipping address in a customer relationship management system matches the address on the most recent customer order Timeliness Are the data up-to-date? Inventory management system reflects real-time inventory levels at each retail location Consistency Are the data presented in the same format? All requested delivery dates are entered in a DD/MM/YY format Completeness Are necessary data missing? Customer shipping address includes all data points necessary to complete a shipment (i.e. name, street address, city, state, and zip code) Table 1. Dimensions of data quality.

The question from knowledge management experts Big Data can leverage on semantics? Yes Commonly used data in BD context: Data generated by humans (mainly disseminated through web tools as social networks, cookies, emails, …) Data generated from connected objects The Internet of human being and the Internet of things become a mix of big data that must be targeted to understand, plan and act in a predictive way

Bidirectionality The relation between Big Data and Semantics is bidirectional As it is true for BD leverages on semantics, some semantics tasks are optimized by using tools designed for large data sets processing

Challenges for Big Data a) Meaningful data integration challenges: Define the problem to solve Identify relevant pieces of data in Big Data ETL it into appropriate formats and store it for processing Disambiguate it Solve the problem

Challenges for Big Data b) Billion Triple Challenge which aims to process large scale target vocabulary and to link that entity to the corresponding sources c) The Linked Open Data ripper for providing good use cases for LOD and to be able to link them with non LOD efficiently d) The value of the use of semantics in data integration and in the design of future DBMS

Challenges for Big Data Semantics could be considered as a magic world to bridge the gap of the heterogeneity of data. Semantics can be used in a decidable system which makes possible to: detect inconsistency of data, generate new knowledge using inference engine or simply link more accurately specific data not relevant for machine learning based techniques.

Challenges for Big Data To determine the quality of datasets published on the web and make this quality information explicit. Assuring data quality is particularly a challenge in LOD as it involves a set of autonomously evolving data sources. Information quality criteria for: Web documents – page trustworthiness versus page rank Structured information – correctness of facts, adequacy of semantic representation, degree of coverage

Trustworthisess of web sources Trustworthiness or accuracy of a web source as the probability that it contains the correct value for a fact, assuming that it mentions any value for that fact. Trustworthiness is orthogonal to PageRank

Data quality assessment methodology A data quality assessment methodology is defined as the process of evaluating if a piece of data meets the information consumers need in a specific case. The process involves measuring the quality dimensions that are relevant to the user and comparing the assessment results with the users quality requirements.