Download presentation
Presentation is loading. Please wait.
Published byBaldric Malone Modified over 9 years ago
1
Data Repairing Giorgos Flouris, FORTH December 11-12, 2012, Luxembourg
2
Slide 2 Structure Part I: problem statement and proposed solution (D2.2) ◦Sketch (also presented in the previous review) Part II: complexity analysis and performance evaluation (D2.2) ◦Shows scalability and performance properties ◦Improved, compared to D2.2 Part III: application of repairing in a real setting (D4.4) ◦Result of collaboration between partners/WPs ◦Shows applicability, experimentation in real-world data and setting
3
Slide 3 PART I: Problem Statement and Proposed Solution (D2.2)
4
Slide 4 Validity as a Quality Indicator Validity is an important quality indicator ◦Encodes context- or application-specific requirements ◦Applications may be useless over invalid data ◦Binary concept (valid/invalid) Two steps to guarantee validity: 1.Identifying invalid ontologies (diagnosis) Detecting invalidities in an automated manner Subtask of Quality Assessment 2.Remove invalidities (repair) Repairing invalidities in an automated manner Subtask of Quality Enhancement
5
Slide 5 Main Idea Expressing validity using validity rules over an adequate relational schema, e.g.: ◦Properties must have a unique domain ◦ p Prop(p) a Dom(p,a) ◦ p,a,b Dom(p,a) Dom(p,b) (a=b) ◦Correct classification in property instances ◦ x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a) ◦ x,y,p,a P_Inst(x,y,p) Rng(p,a) C_Inst(y,a) Syntactical manipulations on rules allow: ◦Diagnosis (reduced to relational queries) ◦Repair (identify repairing options per violation)
6
Slide 6 Preferences for Repair Which repairing option is best? ◦Ontology engineer determines that via preferences Preferences ◦Specified by ontology engineer beforehand ◦High-level “specifications” for the ideal repair ◦Serve as “instructions” to determine the preferred (optimal) solution
7
Slide 7 Preferences (On Ontologies) O0O0 O2O2 O3O3 Score: 3 Score: 4 Score: 6 O1O1
8
Slide 8 Preferences (On Deltas) O0O0 O1O1 O2O2 O3O3 Score: 2 Score: 4 Score: 5 -P_Inst (Item1,ST1, geo:location) +C_Inst (Item1,Sensor) -Dom (geo:location, Sensor)
9
Slide 9 Preferences Preferences on ontologies are result-oriented ◦Consider the quality of the repair result ◦Ignore the impact of repair ◦Popular options: prefer newest/trustable information, prefer a specific ontological structure Preferences on deltas are impact-oriented ◦Consider the impact of repair ◦Ignore the quality of the repair result ◦Popular options: minimize schema changes, minimize addition/deletion of information, minimize delta size Properties of preferences ◦Preferences on ontologies/deltas are equivalent ◦Quality metrics can be used for stating preferences ◦Metadata on the data can be used (e.g., provenance) ◦Can be qualitative or quantitative
10
Slide 10 Generalizing the Approach For one violated rule 1.Diagnose invalidity 2.Determine minimal ways to resolve it 3.Determine and return preferred (optimal) resolution For many violated rules ◦Problem becomes more complicated ◦More than one resolution steps are required Issues: 1.Resolution order 2.When and how to filter non-optimal solutions? 3.Rule (and resolution) interdependencies
11
Slide 11 Rule Interdependencies A given resolution may: ◦Cause other violations (bad) ◦Resolve other violations (good) Optimal resolution unknown ‘a priori’ ◦Cannot predict a resolution’s ramifications ◦Exhaustive, recursive search required (resolution tree) Two ways to create the resolution tree ◦Globally-optimal (GO) / locally-optimal (LO) ◦When and how to filter non-optimal solutions?
12
Slide 12 Resolution Tree Creation (GO) – Find all minimal resolutions for all the violated rules, then find the optimal ones – Globally-optimal (GO) ◦Find all minimal resolutions for one violation ◦Explore them all ◦Repeat recursively until valid ◦Return the optimal leaves Optimal repairs (returned)
13
Slide 13 Resolution Tree Creation (LO) – Find the minimal and optimal resolutions for one violated rule, then repeat for the next – Locally-optimal (LO) ◦Find all minimal resolutions for one violation ◦Explore the optimal one(s) ◦Repeat recursively until valid ◦Return all remaining leaves Optimal repair (returned)
14
Slide 14 Comparison (GO versus LO) Characteristics of GO ◦Exhaustive ◦Less efficient: large resolution trees ◦Always returns optimal repairs ◦Insensitive to rule syntax ◦Does not depend on resolution order Characteristics of LO ◦Greedy ◦More efficient: small resolution trees ◦Does not always return optimal repairs ◦Sensitive to rule syntax ◦Depends on resolution order
15
Slide 15 PART II: Complexity Analysis and Performance Evaluation (D2.2)
16
Slide 16 Complexity Analysis Detailed complexity analysis for GO/LO and various different types of rules and preferences Inherently difficult problem ◦Exponential complexity (in general) ◦Exception: LO is polynomial (in special cases) Theoretical complexity is misleading as to the actual performance of the algorithms
17
Slide 17 Performance in Practice Performance in practice ◦Linear with respect to ontology size ◦Linear with respect to tree size Types of violated rules (tree width) Number of violations (tree height) – causes the exponential blowup Rule interdependencies (tree height) Preference (for LO): affects pruning (tree width) Further performance improvement ◦Use optimizations ◦Use LO with restrictive preference
18
Slide 18 Effect of Ontology Size (logscale) 20000
19
Slide 19 Effect of Tree Size (GO) Nodes (x10 6 )
20
Slide 20 Effect of Tree Size (LO)
21
Slide 21 Effect of Violations (GO)
22
Slide 22 Effect of Violations (LO)
23
Slide 23 Effect of Preference (LO) (logscale)
24
Slide 24 Quality of LO Repairs CCD Max( )Min( )
25
Slide 25 PART III: Application of Repairing in a Real Setting (D4.4)
26
Slide 26 Objectives and Main Idea Repair real datasets using preferences based on metadata Purpose: ◦WP2: evaluate repairing in a real LOD setting ◦WP3: Evaluate the usefulness of provenance, recency etc as preferences for repair ◦WP4: Validate the utility of WP4 resources for a data quality benchmark
27
Slide 27 Motivating Scenario User seeks information on Brazilian cities ◦Fuses Wikipedia dumps from various languages Guarantees maximal coverage, but may lead to conflicts ◦E.g., cities with two different population counts Use repair to eliminate such conflicts ◦Using our repairing method ◦Using adequate preferences based on metadata EN PT ES FR GE
28
Slide 28 Experimental Setting Input ◦Fused 5 Wikipedias: EN, PT, SP, GE, FR ◦Distilled information about three properties of Brazilian cities: populationTotal, areaTotal, foundingDate Repair parameters ◦Validity rules: all properties must be functional ◦Preferences: 5 preferences based on metadata Evaluation ◦Quality of result along 5 dimensions: consistency, validity, conciseness, completeness, accuracy
29
Slide 29 Preferences (1/2) 1. PREFER_PT: resolve conflicts based on source (PT>EN>SP>GE>FR) 2. PREFER_RECENT: resolve conflicts based on recency (most recent data is preferred) 3. PLAUSIBLE_PT: drop “irrational” data (population<500, area<300km 2, founding date<1500AD); resolve remaining conflicts based on source
30
Slide 30 Preferences (2/2) 4. WEIGHTED_RECENT: resolve conflicts based on recency, but if the conflicting records are almost equally recent (less than 3 months apart), then resolve based on source 5. CONDITIONAL_PT: resolve conflicts based on source but change the order depending on the data (prefer PT for small cities with population<500.000, prefer EN for the rest)
31
Slide 31 Consistency, Validity Consistency ◦Lack of conflicting triples ◦Guaranteed to be perfect (by the repairing algorithm), regardless of preference Validity ◦Lack of rule violations ◦Coincides with consistency for this example ◦Guaranteed to be perfect (by the repairing algorithm), regardless of preference
32
Slide 32 Conciseness, Completeness Conciseness ◦No duplicates in the final result ◦Guaranteed to be perfect (by the fuse process), regardless of preference Completeness ◦Coverage of information ◦Improved by fusion ◦Unaffected by the repairing algorithm ◦Input completeness = output completeness, regardless of preference ◦Measured to be at 77,02%
33
Slide 33 Accuracy Most important metric for this experiment Accuracy ◦Closeness to the “actual state of affairs” ◦Affected by the repairing choices Compared repair with the Gold Standard ◦Taken from an official and independent data source (IBGE)
34
Slide 34 Accuracy Examples City of Aracati ◦Population: 69159/69616 (conflicting) ◦Record in Gold Standard: 69159 ◦Good choice: 69159 ◦Bad choice: 69616 City of Oiapoque ◦Population: 20226/20426 (conflicting) ◦Record in Gold Standard: 20509 ◦Optimal approximation choice: 20426 ◦Sub-optimal approximation choice: 20226
35
Slide 35 Accuracy Results
36
Slide 36 Accuracy of Input and Output
37
Slide 37 Publications Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Declarative Repairing Policies for Curated KBs. In Proceedings of the 10 th Hellenic Data Management Symposium (HDMS-11), 2011. Giorgos Flouris, Yannis Roussakis, Maria Poveda-Villalon, Pablo N. Mendes, Irini Fundulaki. Using Provenance for Quality Assessment and Repair in Linked Open Data. In Proceedings of the Joint Workshop on Knowledge Evolution and Ontology Dynamics (EvoDyn-12), 2012. Yannis Roussakis, Giorgos Flouris, Vassilis Christophides. Preference-Based Repairing of RDF(S) DBs. Under review in TODS Journal.
38
Slide 38 BACKUP SLIDES
39
Slide 39 Repair Removing invalidities by changing the ontology in an adequate manner General concerns: 1.Return a valid ontology – Strict requirement 2.Minimize the impact of repair upon the data – Make minor, targeted modifications that repair the ontology without changing it too much 3.Return a “good” repair – Emulate the changes that the ontology engineer would do for repairing the ontology
40
Slide 40 Inference Inference expressed using validity rules Example: ◦Transitivity of class subsumption ◦ a,b,c C_Sub(a,b) C_Sub(b,c) C_Sub(a,c) In practice we use labeling algorithms ◦Avoid explicitly storing the inferred knowledge ◦Improve efficiency of reasoning
41
Slide 41 Ontology O 0 Class(Sensor), Class(SpatialThing), Class(Observation) Prop(geo:location) Dom(geo:location,Sensor) Rng(geo:location,SpatialThing) Inst(Item1), Inst(ST1) P_Inst(Item1,ST1,geo:location) C_Inst(Item1,Observation), C_Inst(ST1,SpatialThing) Example (Diagnosis/Repair) Correct classification in property instances x,y,p,a P_Inst(x,y,p) Dom(p,a) C_Inst(x,a) Sensor SpatialThing Observation Item1ST1 geo:location Schema Data Item1 geo:location ST1 Sensor is the domain of geo:location Item1 is not a Sensor P_Inst(Item1,ST1,geo:location)O 0 Remove P_Inst(Item1,ST1,geo:location) Add C_Inst(Item1,Sensor) Remove Dom(geo:location,Sensor) C_Inst(Item1,Sensor)O 0 Dom(geo:location,Sensor)O 0
42
Slide 42 Quality Assessment Quality = “fitness for use” ◦Multi-dimensional, multi-faceted, context-dependent Methodology for quality assessment ◦Dimensions Aspects of quality Accuracy, completeness, timeliness, … ◦Indicators Metadata values for measuring dimensions Last modification date (related to timeliness) ◦Scoring Functions Functions to quantify quality indicators Days since last modification date ◦Metrics Measures of dimensions (result of scoring function) Can be combined
43
Slide 43 en.dbpediapt.dbpedia integrated data Gold Standard Instituto Brasileiro de Geografia e Estatística (IBGE) Fuse/Repair Compare Accuracy dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate dbpedia:areaTotal dbpedia:populationTotal dbpedia:foundingDate Accuracy Evaluation fr.dbpedia …
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.