Big Data Quality the next semantic challenge Maria Teresa PAZIENZA a.a. 2017-18
(BIG) DATA IS ONLY AS USEFUL AS ITS QUALITY
Introduction (Since Big Data is big and messy), challenges can be classified into engineering tasks (managing data at an unumaginable scale) and semantics (finding and meaningfully combining information that is relevant to your needs)
Challenges for Big Data Identify relevant pieces of information in messy data. Named entity resolution (event extraction in tweets –short texts) Coreference resolution (if 2 mentions refer to each other-indexing billions of RDF triples -data formats easy to use RDF/RDFS, OWL) Information extraction (difficult to scale) Paraphrase resolution (it aims at identifying an entry in a given knowledge base to which an entity-mention-in-a-document refers) Ontology population entity consolidation (organizing extracted tuples in a quering form such as instances of ontologies, tuples of a database for schema or set of quads –subject, predicate, object, context-)
Basic assumptions Datasets published on the web of data cover a diverse set of domains Data on the web reveals a large variation in data quality . Data extracted from semi-structured sources –Dbpedia etc.- often contain inconsistencies as well as misrepresented and incomplete information Even datasets with quality problems might be useful for certain applications as long as the quality is in the required range (in different application contexts)
Variety as a Big Data issue Variety as a Big Data issue is distinct in that established small scale methods are insufficient. The Big Data notion of variety is a generalization of semantic heterogeneity as studied in the field of databases, artificial intelligence, semantic web and cognitive science since many years.
Quality on the Web specific aspects Coherence via links to external datasets Data representation quality Consistency with regard to implicit information (inference mechanisms for knowledge representation formalisms on the web -owl- usually follow an open world assumption, whereas databases usually adopt closed world semantics) Ontology quality No consensus on how data quality dimensions and metrics should be defined
Quality on the Web specific aspects The challenges are related to openness of the web of data, diversity of the information and unbound, dynamic set of autonomous data sources and publishers.
Dimensions of data quality Organized into two categorie: contextual, referring to attributes that are dependent on the context in which the data are observed or used, and intrinsic, referring to attributes that are objective and native to the data.
Contextual dimensions of data quality Include at least relevancy, value added , quantity, believability, accessibility, understandibility, availability, verifiability and reputation of the data. Contextual dimensions of data quality lend themselves more towards information as opposed to data because these dimensions are formed by placing data within a situation or problem specific context.
Intrinsic dimensions of data quality Intrinsic data quality has 4 dimensions: Accuracy (degree to which data are equivalent to their corresponding «real» values) Timeliness (degree to which data are up-to-date: currency or lenght of time since the record’s last update, volatility which describes the frequency of updates) Consistency (degree to which related- data- records -match in terms of format and structure) Completeness (degree to which data are full and complete in content, with no missing data) Es: indirizzo
Intrinsic dimensions of data quality Data quality dimension Description Supply chain example Accuracy Are the data free of errors? Customer shipping address in a customer relationship management system matches the address on the most recent customer order Timeliness Are the data up-to-date? Inventory management system reflects real-time inventory levels at each retail location Consistency Are the data presented in the same format? All requested delivery dates are entered in a DD/MM/YY format Completeness Are necessary data missing? Customer shipping address includes all data points necessary to complete a shipment (i.e. name, street address, city, state, and zip code) Table 1. Dimensions of data quality.
The question from knowledge management experts Big Data can leverage on semantics? Yes Commonly used data in BD context: Data generated by humans (mainly disseminated through web tools as social networks, cookies, emails, …) Data generated from connected objects The Internet of human being and the Internet of things become a mix of big data that must be targeted to understand, plan and act in a predictive way
Bidirectionality The relation between Big Data and Semantics is bidirectional As it is true for BD leverages on semantics, some semantics tasks are optimized by using tools designed for large data sets processing
Challenges for Big Data a) Meaningful data integration challenges: Define the problem to solve Identify relevant pieces of data in Big Data ETL it into appropriate formats and store it for processing Disambiguate it Solve the problem
Challenges for Big Data b) Billion Triple Challenge which aims to process large scale target vocabulary and to link that entity to the corresponding sources c) The Linked Open Data ripper for providing good use cases for LOD and to be able to link them with non LOD efficiently d) The value of the use of semantics in data integration and in the design of future DBMS
Challenges for Big Data Semantics could be considered as a magic world to bridge the gap of the heterogeneity of data. Semantics can be used in a decidable system which makes possible to: detect inconsistency of data, generate new knowledge using inference engine or simply link more accurately specific data not relevant for machine learning based techniques.
Challenges for Big Data To determine the quality of datasets published on the web and make this quality information explicit. Assuring data quality is particularly a challenge in LOD as it involves a set of autonomously evolving data sources. Information quality criteria for: Web documents – page trustworthiness versus page rank Structured information – correctness of facts, adequacy of semantic representation, degree of coverage
Trustworthisess of web sources Trustworthiness or accuracy of a web source as the probability that it contains the correct value for a fact, assuming that it mentions any value for that fact. Trustworthiness is orthogonal to PageRank
Data quality assessment methodology A data quality assessment methodology is defined as the process of evaluating if a piece of data meets the information consumers need in a specific case. The process involves measuring the quality dimensions that are relevant to the user and comparing the assessment results with the users quality requirements.