Presentation is loading. Please wait.

Presentation is loading. Please wait.

2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2.

Similar presentations


Presentation on theme: "2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2."— Presentation transcript:

1 2014-May-07

2 What is the problem? What have others done? What is our solution? Does it work? Outline 2

3 What is the problem? Linked Open Data (LOD): ▫Realizing Semantic Web by interlinking existing but dispersed data Main components of LOD: ▫URIs to identify thingsURIs ▫RDF to describe dataRDF ▫HTTP to access dataHTTP 3

4 Datasets: 295 Triples: over 30,000,000,000 (30 B) Links: over 500,000,000 (500 M) 4 What is the problem?

5 Inclusion Criteria for publishing and interlinking datasets into LOD cloud resolvable http/https URIs Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples) Contains at least 1000 triples Connected via at least 50 RDF links to the existing datasets of LOD Accessible via RDF crawling, RDF dump, or SPARQL endpoint Is dataset ready to publish? 5 What is the problem?

6 6 Publishing first, improving later Idea of the LOD: Publishing first, improving later quality problems in the published datasets Results in: quality problems in the published datasets Missing link: What is the problem?

7 Data quality in the Context of LOD General Validators Parsing and Syntax Accessibility / Dereferencability Validators Quality Assessment of Published data Classifying quality problems of LOD Using metadata for quality assessment filtering poor quality data (WIQA) Semantic Annotation using ontologies 7 What have others done?

8 Limitations of related works: Syntax validation, not quality evaluation Not scalable Not full automated Evaluation after publishing 8 What have others done?

9 What is our solution? Proposing a set of metrics for Inherent quality assessment of datasets before interlinking to LOD cloud 9

10 10 What is our solution?

11 Studying data quality models Defining inherent quality of LOD Selecting the basic model (ISO-25012) Mapping quality dimensions of ISO to LOD 11 1. Selecting Inherent Quality Dimensions

12 Inherent Quality of LOD InterlinkingCompletenessSemantic AccuracySyntax AccuracyUniquenessConsistency 12 1. Selecting Inherent Quality Dimensions

13 Defining metrics using GQM Implementing an automated tool Formal definition 13 2. Proposing Metrics Example: Goal: Goal: Assessment of the consistency of a dataset in the context of LOD Question: Question: What is the degree of conflict in the context of data value? Metric: Metric: The number of functional properties with inconsistent values

14 14 LODQM: Linked Open Data Quality Model 6 Quality dimensions 6 Quality dimensions 32 Metrics 32 Metrics 3. Developing LODQM

15 Using Theoretical Measurement Framework Identifying properties of desirable metrics Validating metrics 15 4. Theoretical Validation Metric Type Number of metrics Null- Value Non- Negativity SymmetryMonotonicity Disjoint Module Additivity Merging Cohesive Modules Complexity 29 √ √ √√ n/a __ Cohesion 2 √ √ _ √ __ √ Coupling 1 √√ _ √ n/a √ _

16 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions 16 5. Empirical Evaluation 5.1 5.2 5.3 5.4 5.5 5.6 5.7

17 17 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions Datasets No. of triples No. of instances No. of classes No. of properties FAO Water Areas 10,7305863119 Water Economic Zones 29,1931,074113127 Large Marine Ecosystems 12,0127162131 Geopolitical Entities 22,72531288101 ISSCAAP Species Classification 398,16625,2535293 Species Taxonomic Classification 319,49011,7413326 Commodities 56,4202,7881019 Vessels 4,236240622 5. Empirical Evaluation √

18 18 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ 5. Empirical Evaluation

19 19 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ 5. Empirical Evaluation Result: Three pairs of metrics are correlated: {IFP, Im_DT} {Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF} The others are independent

20 20 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ 5. Empirical Evaluation

21 21 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ √ √ 5. Empirical Evaluation

22 22 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ √ √ 5. Empirical Evaluation Result: Only one pair of quality dimensions is correlated: {Interlinking, Syntactic accuracy} The others are independent √

23 Applying PCA Method to select the highly correlated metrics Developing predictive models Assessing the quality of new datasets using models 23 6. Quality Prediction Result: 20 out of 32 metrics are selected Using Neural Network Method: MultiLayerPerceptron Dataset No. of triplesNo. of instancesDomain Geonames 6,590699 Geography IMDB 866291 Movie Anatomy 6,4496449 Anatomy Citeseer 948,770173963 Publication FAO 248,73128,098 Food Science

24 24 6. Quality Prediction

25 Conclusion on Metrics 25 Definable Proposed by GQM (32) Formally defined (32)Valid Theoretically validated (32)Practical Implemented (32) Correlated with quality Experts (28) Correlation study (27) PCA (20) Predictability MLP (20)

26 Appreciative of your Attention and Comments


Download ppt "2014-May-07. What is the problem? What have others done? What is our solution? Does it work? Outline 2."

Similar presentations


Ads by Google