Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11.

Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11

18 24 30366 12 0 Task 2.1 Data quality assessment and repair Task 2.3 Recommendations for enhancing best practices for data publishing D2.4 Update of D2.1 D2.3 Modelling and processing contextual aspects of data D2.5 Proof-of-concept evaluation for modelling space and time FUB 4248 D2.1 Conceptual model and best practices for high-quality data publishing D2.2 Methods for quality repair KIT KIT Work Plan View WP2 D2.6 Methods for assessing the quality of sensor data D2.7 Recommendations for contextual data publishing Task 2.2 Temporal, spatial and social aspects of data

Upcoming deliverables Quality Assessment D2.1 - Conceptual model and best practices for high-quality metadata publishing Quality Enhancement D2.2 - Methods for quality repair

Outline Overview of Quality Data Quality Framework Quality Assessment Quality Enhancement (Repair)

“Fitness for use.” Joseph Juran. The Quality Control Handbook. McGraw-Hill, New York, 3rd edition, 1974. Quality

Data Quality Multifaceted accurate = high quality? availability? timeliness? Subjective weekly updates are ok. Task-dependent task: weather forecast data is not good if it is not available for online query vacation planning or aviation? for me, for vacation planning

CategoryDimension Intrinsic Dimensions Accuracy Consistency Objectivity Timeliness Contextual Dimensions Validity Believability Completeness Understandability Relevancy Reputation Verifiability Amount of Data Representational Dimensions Interpretability Rep. Conciseness Rep. Consistency Accessibility Dimensions Availability Response Time Security Data Quality Dimensions Presentation order

Quality Enhancement Quality Assessment Data Quality Framework

ACCESSIBILITY Dereferenceability Indicator: Dereferenceable URIs “Resources identified by URIs that respond with RDF to HTTP requests?” Metrics: for datasets (d) and for resources (r) deref(d) = count(r | deref(r)) ratioderef(d) = deref(d) / no-deref(r) Recommendation: Your URIs should be dereferenceable. Prefer reusing URIs that are dereferenceable.

Access methods Indicator: Access methods “Data is accessible in varied and recommended ways.” Metrics: sample(d): {0,1} “example resource available for d” endpoint(d): {0,1} “SPARQL endpoint available for d” dump(d): {0,1} “RDF dumps available for d” Recommendation: Provide as many access methods as possible A sample resource provides a quick view into the type of data you serve. SPARQL endpoints for clients to obtain part of the data Dumps are cheaper than alternatives when bulk access is needed ACCESSIBILITY

Availability Indicator: Availability “Average availability in time interval” Metrics: avail(d,hour) = ∑ {1..24} deref(sample(d)) / 24 Alternatively, httphead() instead of deref() Recommendation: the higher the better ACCESSIBILITY

Accessiblity Dimensions Dereferenceability Availability Access methods Response time Robustness Reachability http GET / HEAD hourly derefs URI, Bulk, SPARQL timed deref requests per minute LOD cloud inlinks ACCESSIBILITY Examples:

Representational: Interpretability Indicator: Human/Machine interpretability “URI is dereferenceable to human and machine readable formats” Metrics: format(deref(r,f)) in {F h U F m } : {0,1} F h = HTML, XHTML+RDFa,...: {0,1} F m = NT, RDF/XML,...: {0,1} Recommendation: Resources should dereference at least to human- readable HTML and one widely adopted RDF serialization. REPRESENTATIONAL

Vocabulary understandability Schema understandability “Schema terms are familiar to existing agents.” Metrics: vocab-underst(d) = triples(v,d) * triples(v,D) / triples(D) Alt: Page Rank (prob. that random surfer has found v) Recommendation: Reuse widely deployed vocabularies.

Representational Dimensions Human/Machine Interpretability Vocabulary Understandability Representational Conciseness HTML, RDF Vocabulary usage stats Triples / Byte REPRESENTATIONAL

Contextual Dimensions Completeness Full set of objects and attributes wrt to a task Conciseness Amount of duplicate entries, redundant attributes Coherence How well instance data conforms to schema CONTEXTUAL DIMENSIONS

“The level of structuredness of a dataset D with respect to a type T is determined by how well the instance data in D conform to type T” Coherence Coherence (IBM paper at SIGMOD'11) CONTEXTUAL DIMENSIONS

Conciseness “An increase in conciseness is achieved by removing redundant data, by fusing duplicate entries and merging common attributes into one.” (Data Fusion, Bleiholder et al. 2009) Example: Given a set of instance mappings (sameAs), are all properties of the set assigned to only one URI?

Contextual Dimensions Verifiability How easy it is to check the data? Can use provenance information. Validity Encodes context- or application-specific requirements CONTEXTUAL DIMENSIONS

Verifiability Provides provenance Data is published by the producer Creation date Data generation process (manual, IE, sensor?) Data transformations undergone

INTRINSIC DIMENSIONS Intrinsic Dimensions Accuracy usually estimated; may be available for sensors Timeliness can use last update Consistency two or more values do not conflict with each other Objectivity Can be traced via provenance

Other: Flemming & Hartig Content, Representation, Usage, System

Example: AEMET Metadata entry: http://thedatahub.org/dataset/aemet http://thedatahub.org/dataset/aemet Example item: http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl http://aemet.linkeddata.es/page/resource/WeatherStation/id08001?output=ttl Access methods: Example URI, SPARQL, Bulk Availability: Example URI: available SPARQL Endpoint: 100% Format Interpretability: TTL=OK RDF/XML=OK Verifiability: Published by third party http://www4.wiwiss.fu-berlin.de/lodcloud/ckan/validator/validate.php?package=aemet

Indicators for AEMET 11/02/11

Quality Enhancement Quality Assessment Data Quality Framework

Validity as a Quality Indicator Validity is an important quality indicator Encodes context- or application-specific requirements Applications may be useless over invalid data Binary concept (valid/invalid) Two steps to guarantee validity (repair process): 1.Identifying invalid ontologies (diagnosis) Detecting invalidities in an automated manner Subtask of Quality Assessment 2.Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement

Diagnosis Expressing validity using validity rules over an adequate relational schema Examples: Properties must have a unique domain p Prop(p)  a Dom(p,a) p,a,b Dom(p,a)  Dom(p,b)  (a=b) Correct classification in property instances x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) x,y,p,a P_Inst(x,y,p)  Rng(p,a)  C_Inst(y,a) Diagnosis reduced to relational queries

Ontology O 0 Class(Sensor), Class(SpatialThing), Class(Observation) Prop(geo:location) Dom(geo:location,Sensor) Rng(geo:location,SpatialThing) Inst(Item1), Inst(ST1) P_Inst(Item1,ST1,geo:location) C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing) Example Correct classification in property instances x,y,p,a P_Inst(x,y,p)  Dom(p,a)  C_Inst(x,a) Sensor SpatialThing Observation Item1ST1 geo:location Schema Data Item1 geo:location ST1 Sensor is the domain of geo:location Item1 is not a Sensor P_Inst(Item1,ST1,geo:location)O 0 Remove P_Inst(Item1,ST1,geo:location) Add C_Inst(Item1,Sensor) Remove Dom(geo:location,Sensor) C_Inst(Item1,Sensor)O 0 Dom(geo:location,Sensor)O 0

Preferences for Repair Which repairing option is best? Ontology engineer determines that via preferences Specified by ontology engineer beforehand High-level “specifications” for the ideal repair Serve as “instructions” to determine the preferred solution

Preferences (On Ontologies) O0O0 O2O2 O3O3 Score: 3 Score: 4 Score: 6 O1O1

Preferences (On Deltas) O0O0 O1O1 O2O2 O3O3 Score: 2 Score: 4 Score: 5 -P_Inst (Item1,ST1, geo:location) +C_Inst (Item1,Sensor) -Dom (geo:location, Sensor)

Preferences Preferences on ontologies are result-oriented Consider the quality of the repair result Ignore the impact of repair Popular options: prefer newest information, prefer trustable information Preferences on deltas are more impact-oriented Consider the impact of repair Ignore the quality of the repair result Popular options: minimize schema changes, minimize addition/deletion of information, minimize delta size Two sides of the same coin (equivalent options) Quality metrics can be used for stating preferences Metadata on the data may be needed Can be qualitative or quantitative

Generalizing the Approach For one violated constraint 1.Diagnose invalidity 2.Determine minimal ways to resolve it 3.Determine and return preferred resolution For many violated constraints Problem becomes more complicated More than one resolution steps are required Issues: 1.Resolution order 2.When and how to filter non-preferred solutions? 3.Constraint (and resolution) interdependencies

Constraint Interdependencies A given resolution may: Cause other violations (bad) Resolve other violations (good) Cannot pre-determine the best resolution Difficult to predict the ramifications of each one Exhaustive search required Recursive, tree-based search (resolution tree) Two ways to create the resolution tree Globally-preferred (GP), locally-preferred (LP) When and how to filter non-preferred solutions?

Resolution Tree Creation (GP) Find all minimal resolutions for all the violated constraints, then find the preferred ones Globally-preferred (GP) Find all minimal resolutions for one violation Explore them all Repeat recursively until consistent Return the preferred leaves Preferred repairs (returned)

Resolution Tree Creation (LP) Find the minimal and preferred resolutions for one violated constraint, then repeat for the next Locally-preferred (LP) Find all minimal resolutions for one violation Explore the preferred one(s) Repeat recursively until consistent Return all remaining leaves Preferred repair (returned)

Comparison (GP versus LP) Characteristics of GP Exhaustive Less efficient: large resolution trees Always returns most preferred repairs Insensitive to constraint syntax Does not depend on resolution order Characteristics of LP Greedy More efficient: small resolution trees Does not always return most preferred repairs Sensitive to constraint syntax Depends on resolution order

Algorithm and Complexity Detailed complexity analysis for GP/LP and various different types of constraints and preferences Inherently difficult problem Exponential complexity (in general) Main exception: LP is polynomial (in special cases) Theoretical complexity is misleading as to the actual performance of the algorithms

Performance in Practice Performance in practice Linear with respect to ontology size Linear with respect to tree size Types of violated constraints (tree width) Number of violations (tree height) – causes the exponential blowup Constraint interdependencies (tree height) Preference (for LP): affects pruning (tree width) Further performance improvement Use optimizations Use LP with restrictive preference

Evaluation Parameters Evaluation 1.Effect of ontology size (for GP/LP) 2.Effect of tree size (for GP/LP) 3.Effect of violations (for GP/LP) 4.Effect of preference (relevant for LP only) 5.Quality of LP repairs Preliminary results support our claims: Linear with respect to ontology size Linear with respect to tree size

Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10 th Hellenic Data Management Symposium (HDMS-11), 2011 Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF/S DBs. Tentative title, to be submitted to PVLDB, January 2012

Outlook Continue refining model based on experience with data sets catalog Derive “best practices checks” from metrics Results of quality assessment to be added to next release of the catalog Collaboration with EU-funded LOD2 (FP7) towards Data Fusion based on the PlanetData Quality Framework Finalize experiments for Data Repair

Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11.

Similar presentations

Presentation on theme: "Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11.

Similar presentations

Presentation on theme: "Quality and Repair Pablo N. Mendes (Freie Universität Berlin) Giorgos Flouris (FORTH) 1st year review Luxembourg, December 2011 11/02/11."— Presentation transcript:

Similar presentations

About project

Feedback