2014-May-07
What is the problem? What have others done? What is our solution? Does it work? Outline 2
What is the problem? Linked Open Data (LOD): ▫Realizing Semantic Web by interlinking existing but dispersed data Main components of LOD: ▫URIs to identify thingsURIs ▫RDF to describe dataRDF ▫HTTP to access dataHTTP 3
Datasets: 295 Triples: over 30,000,000,000 (30 B) Links: over 500,000,000 (500 M) 4 What is the problem?
Inclusion Criteria for publishing and interlinking datasets into LOD cloud resolvable http/https URIs Presented in one of the standard formats of Semantic Web (RDF, RDFa, RDF/XML, Turtle, N-Triples) Contains at least 1000 triples Connected via at least 50 RDF links to the existing datasets of LOD Accessible via RDF crawling, RDF dump, or SPARQL endpoint Is dataset ready to publish? 5 What is the problem?
6 Publishing first, improving later Idea of the LOD: Publishing first, improving later quality problems in the published datasets Results in: quality problems in the published datasets Missing link: What is the problem?
Data quality in the Context of LOD General Validators Parsing and Syntax Accessibility / Dereferencability Validators Quality Assessment of Published data Classifying quality problems of LOD Using metadata for quality assessment filtering poor quality data (WIQA) Semantic Annotation using ontologies 7 What have others done?
Limitations of related works: Syntax validation, not quality evaluation Not scalable Not full automated Evaluation after publishing 8 What have others done?
What is our solution? Proposing a set of metrics for Inherent quality assessment of datasets before interlinking to LOD cloud 9
10 What is our solution?
Studying data quality models Defining inherent quality of LOD Selecting the basic model (ISO-25012) Mapping quality dimensions of ISO to LOD Selecting Inherent Quality Dimensions
Inherent Quality of LOD InterlinkingCompletenessSemantic AccuracySyntax AccuracyUniquenessConsistency Selecting Inherent Quality Dimensions
Defining metrics using GQM Implementing an automated tool Formal definition Proposing Metrics Example: Goal: Goal: Assessment of the consistency of a dataset in the context of LOD Question: Question: What is the degree of conflict in the context of data value? Metric: Metric: The number of functional properties with inconsistent values
14 LODQM: Linked Open Data Quality Model 6 Quality dimensions 6 Quality dimensions 32 Metrics 32 Metrics 3. Developing LODQM
Using Theoretical Measurement Framework Identifying properties of desirable metrics Validating metrics Theoretical Validation Metric Type Number of metrics Null- Value Non- Negativity SymmetryMonotonicity Disjoint Module Additivity Merging Cohesive Modules Complexity 29 √ √ √√ n/a __ Cohesion 2 √ √ _ √ __ √ Coupling 1 √√ _ √ n/a √ _
Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions Empirical Evaluation
17 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions Datasets No. of triples No. of instances No. of classes No. of properties FAO Water Areas 10, Water Economic Zones 29,1931, Large Marine Ecosystems 12, Geopolitical Entities 22, ISSCAAP Species Classification 398,16625, Species Taxonomic Classification 319,49011, Commodities 56,4202, Vessels 4, Empirical Evaluation √
18 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ 5. Empirical Evaluation
19 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ 5. Empirical Evaluation Result: Three pairs of metrics are correlated: {IFP, Im_DT} {Im_DT, Sml_Cls} {Inc_Prp_Vlu, IF} The others are independent
20 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ 5. Empirical Evaluation
21 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ √ √ 5. Empirical Evaluation
22 Selecting several real datasets from LOD Calculation of the metrics values for datasets Metrics interdependency Study Manipulating the quality of the datasets using heuristics Comparing the trends of Metrics over two observations Collecting experts’ subjective perception on quality dimensions Correlation study between metrics and quality dimensions√ √ √ √ √ √ 5. Empirical Evaluation Result: Only one pair of quality dimensions is correlated: {Interlinking, Syntactic accuracy} The others are independent √
Applying PCA Method to select the highly correlated metrics Developing predictive models Assessing the quality of new datasets using models Quality Prediction Result: 20 out of 32 metrics are selected Using Neural Network Method: MultiLayerPerceptron Dataset No. of triplesNo. of instancesDomain Geonames 6, Geography IMDB Movie Anatomy 6, Anatomy Citeseer 948, Publication FAO 248,73128,098 Food Science
24 6. Quality Prediction
Conclusion on Metrics 25 Definable Proposed by GQM (32) Formally defined (32)Valid Theoretically validated (32)Practical Implemented (32) Correlated with quality Experts (28) Correlation study (27) PCA (20) Predictability MLP (20)
Appreciative of your Attention and Comments