A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.

Slides:



Advertisements
Similar presentations
Leveraging Data and Structure in Ontology Integration Octavian Udrea 1 Lise Getoor 1 Renée J. Miller 2 1 University of Maryland College Park 2 University.
Advertisements

Leveraging Community-built Knowledge For Type Coercion In Question Answering Aditya Kalyanpur, J William Murdock, James Fan and Chris Welty Mehdi AllahyariSpring.
Reducing the Cost of Validating Mapping Compositions by Exploiting Semantic Relationships Eduard C. Dragut Ramon Lawrence Eduard C. Dragut Ramon Lawrence.
Linked Sensor Data Harshal Patni, Cory Henson, Amit P. Sheth Ohio Center of Excellence in Knowledge enabled Computing (Kno.e.sis) Wright State University,
Search Engines and Information Retrieval
NaLIX: A Generic Natural Language Search Environment for XML Data Presented by: Erik Mathisen 02/12/2008.
1/55 EF 507 QUANTITATIVE METHODS FOR ECONOMICS AND FINANCE FALL 2008 Chapter 10 Hypothesis Testing.
Anatomy: Simple and Effective Privacy Preservation Israel Chernyak DB Seminar (winter 2009)
Using Maximal Embedded Subtrees for Textual Entailment Recognition Sophia Katrenko & Pieter Adriaans Adaptive Information Disclosure project Human Computer.
Experimental Evaluation
CBLOCK: An Automatic Blocking Mechanism for Large-Scale Deduplication Tasks Ashwin Machanavajjhala Duke University with Anish Das Sarma, Ankur Jain, Philip.
Standard Error of the Mean
Natural Language Processing Expectation Maximization.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Copyright © 2010 Accenture All Rights Reserved. Accenture, its logo, and High Performance Delivered are trademarks of Accenture. Multiple Ontologies in.
Semantic Interoperability Jérôme Euzenat INRIA & LIG France Natasha Noy Stanford University USA.
Ontology Matching Basics Ontology Matching by Jerome Euzenat and Pavel Shvaiko Parts I and II 11/6/2012Ontology Matching Basics - PL, CS 6521.
Basic Statistics Michael Hylin. Scientific Method Start w/ a question Gather information and resources (observe) Form hypothesis Perform experiment and.
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Discovering Interesting Subsets Using Statistical Analysis Maitreya Natu and Girish K. Palshikar Tata Research Development and Design Centre (TRDDC) Pune,
Querying RDF Data with Text Annotated Graphs Lushan Han, Tim Finin, Anupam Joshi and Doreen Cheng SSDBM’15 
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
 Copyright 2006 Digital Enterprise Research Institute. All rights reserved. Collaborative Building of Controlled Vocabularies Crosswalks Mateusz.
Krishnaprasad Thirunarayan, Pramod Anantharam, Cory A. Henson, and Amit P. Sheth Kno.e.sis Center, Ohio Center of Excellence on Knowledge-enabled Computing,
More About Significance Tests
Information Systems: Modelling Complexity with Categories Four lectures given by Nick Rossiter at Universidad de Las Palmas de Gran Canaria, 15th-19th.
Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.
Ontology Alignment for Linked Open Data – ISWC2010 research track Prateek Jain Pascal Hitzler Amit Sheth Kno.e.sis Center Wright State University, Dayton,
Mining and Analysis of Control Structure Variant Clones Guo Qiao.
Querying Structured Text in an XML Database By Xuemei Luo.
Semantic Enrichment of Ontology Mappings: A Linguistic-based Approach Patrick Arnold, Erhard Rahm University of Leipzig, Germany 17th East-European Conference.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
Metadata. Generally speaking, metadata are data and information that describe and model data and information For example, a database schema is the metadata.
Efficiently Computed Lexical Chains As an Intermediate Representation for Automatic Text Summarization H.G. Silber and K.F. McCoy University of Delaware.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Shridhar Bhalerao CMSC 601 Finding Implicit Relations in the Semantic Web.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
1 Generating Comparative Summaries of Contradictory Opinions in Text (CIKM09’)Hyun Duk Kim, ChengXiang Zhai 2010/05/24 Yu-wen,Hsu.
Adjudicator Agreement and System Rankings for Person Name Search Mark Arehart, Chris Wolf, Keith Miller The MITRE Corporation {marehart, cwolf,
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Modeling Security-Relevant Data Semantics Xue Ying Chen Department of Computer Science.
Estimation by Intervals Confidence Interval. Suppose we wanted to estimate the proportion of blue candies in a VERY large bowl. We could take a sample.
Jump to first page Inferring Sample Findings to the Population and Testing for Differences.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
A Study in Hadoop Streaming with Matlab for NMR data processing Kalpa Gunaratna1, Paul Anderson2, Ajith Ranabahu1 and Amit Sheth1 1Ohio Center of Excellence.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Gleaning Types for Literals in RDF Triples with Application to Entity Summarization 1 Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis),
Genetic Algorithm. Outline Motivation Genetic algorithms An illustrative example Hypothesis space search.
EmojiNet: A Machine Readable Emoji Sense Inventory
SIMILARITY SEARCH The Metric Space Approach
Query Rewriting Framework for Spatial Data
Hypothesis Testing and Confidence Intervals (Part 1): Using the Standard Normal Lecture 8 Justin Kern October 10 and 12, 2017.
Learning Software Behavior for Automated Diagnosis
Collective Network Linkage across Heterogeneous Social Platforms
A Schema and Instance Based RDF Dataset Summarization Tool
Extracting Semantic Concept Relations
Rui Yang, Wei Hu and Yuzhong Qu
Enriching Structured Knowledge with Open Information
[jws13] Evaluation of instance matching tools: The experience of OAEI
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Filtering Properties of Entities By Class
Template-based Question Answering over RDF Data
Simulation Berlin Chen
Presentation transcript:

A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA ‡ IBM T J Watson Research Center Yorktown Heights New York NY, USA Kalpa Gunaratna †, Krishnaprasad Thirunarayan †, Prateek Jain ‡, Amit Sheth †, and Sanjaya Wijeratne † iSemantics 2013, Graz, Austria

09/06/20132 Motivation Why we need property alignment and it is so important? iSemantics 2013

09/06/20133 Many datasets. We can query! Same information in different names. Therefore, data integration for better presentation is required. iSemantics 2013

 Background  Statistical Equivalence of properties  Evaluation  Discussion, interesting facts, and future directions  Conclusion 09/06/20134 Roadmap iSemantics 2013

 Existing techniques for property alignment fall into three categories. I.Syntactic/dictionary based – Uses string manipulation techniques, external dictionaries and lexical databases like WordNet. II.Schema dependent – Uses schema information such as, domain and range, definitions. III.Schema independent – Uses instance level information for the alignment.  Our approach falls under schema independent. 09/06/20135 Background iSemantics 2013

 Properties capture meaning of triples and hence they are complex in nature.  Syntactic or dictionary based approaches analyze property names for equivalence. But in LOD, name heterogeneities exist.  Therefore, syntactic or dictionary based approaches have limited coverage in property alignment.  Schema dependent approaches including processing domain and range, class level tags do not capture semantics of properties well. 09/06/20136iSemantics 2013

 Background  Statistical Equivalence of properties  Evaluation  Discussion, interesting facts, and future directions  Conclusion 09/06/20137 Roadmap iSemantics 2013

 Statistical Equivalence is based on analyzing owl:equivalentProperty.  owl:equivalentProperty - properties that have same property extensions. Example 1: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q f } P and Q are owl:equivalentProperty, because they have the same extension, { {a,b}, {c,d}, {e,f} } Example 2: Property P is defined by the triples, { a P b, c P d, e P f } Property Q is defined by the triples, { a Q b, c Q d, e Q h } Then, P and Q are not owl:equivalentProperty, because their extensions are not the same. But they provide statistical evidence in support of equivalence. 09/06/20138 Statistical Equivalence of properties iSemantics 2013

Intuition  Higher rate of subject-object matches in extensions leads to equivalent properties. In practice, it is hard to have exact same extensions for matching properties. Because, – Datasets are incomplete. – Same instance may be modelled differently in different datasets.  Therefore, we analyze the property extensions to identify equivalent properties between datasets.  We define the following notions. Let the statement below be true for all the definitions. S 1 P 1 O 1 and S 2 P 2 O 2 be two triples in Dataset D 1 and D 2 respectively. 09/06/2013iSemantics 20139

09/06/2013iSemantics

09/06/2013iSemantics

09/06/2013iSemantics I1I1 I2I2 I2I2 matching resources owl:sameAs P 1 =d1:doctoralStudent P 2 =d2:education. academic.advisees Dataset 2Dataset 1 property P 1 and property P 2 are a candidate match d2:theodore_harold_maiman I 1 =d1:Willis_Lamb I 2 =d2:willis_lamb I1I1 I1I1 d1:Theodore_Maiman triple 1 triple 2 triple 3 triple 4 triple 5 Step 1 Step 2 Step 3 Candidate Matching Algorithm Process

09/06/2013iSemantics Complexity: If the average number of properties for an entity is x and for each property, average number of objects is j. For n subjects, it requires n*j 2 *x 2 +2n comparisons. Since n > j, n > x, and x and j are independent of n, O(n).

Example: 09/06/2013iSemantics

09/06/2013iSemantics

 Background  Statistical Equivalence of properties  Evaluation  Discussion, interesting facts, and future directions  Conclusion 09/06/ Roadmap iSemantics 2013

 Objectives of the evaluation – Show the effectiveness of the approach in linked datasets – Compare with existing aligning techniques  We selected 5000 instance samples from DBpedia, Freebase, LinkedMDB, DBLP L3S, and DBLP RKB Explorer datasets.  These datasets have, – Complete data for instances in different viewpoints – Many inter-links – Complex properties 09/06/2013iSemantics Evaluation

 Experiment details – α = 0.5 for all experiments (works for LOD) except DBpedia and Freebase movie alignment where it was 0.7. – k was set as 14, 6, 2, 2, and 2 respectively for Person, Film and Software between DBpedia and Freebase, Film between LinkedMDB and DBpedia, and article between DBLP datasets. – k can be estimated using the data as follows, – Set α = 0.5 and k = 2 (lowest positive values). – Get exact matching property (property names) pairs not identified by the algorithm and their μ – Get the average of those μ values – 0.92 for string similarity algorithms. – 0.8 for WordNet similarity. 09/06/2013iSemantics

Measure type DBpedia – Freebase (Person) DBpedia – Freebase (Film) DBpedia – Freebase (Software) DBpedia – LinkedMDB (Film) DBLP_RKB – DBLP_L3S (Article) Average Extension Based Algorithm Precision Recall0.8089* F measure0.8410* WordNet Similarity Precision Recall0.4140* F measure0.4609* Dice Similarity Precision Recall0.4777* F measure0.6000* Jaro Similarity Precision Recall0.5350* F measure0.5978* /06/2013iSemantics Alignment results * Marks estimated values for experiment 1 because of very large comparisons to check manually. Boldface marks highest result for each experiment.

 Example identifications 09/06/2013iSemantics Property pair types Dataset 1 (DBpedia)Dataset 2 (Freebase) Simple string similarity matches db:nationalityfb:nationality db:religionfb:religion Synonymous matches db:occupationfb:profession db:battlesfb:participated_in_conflicts Complex matchesdb:screenplayfb:written_by db:doctoralStudentfb:advisees WordNet similarity failed to identify any of these

 Background  Statistical Equivalence of properties  Evaluation  Discussion, interesting facts, and future directions  Conclusion 09/06/ Roadmap iSemantics 2013

 Our experiment covered multi-domain to multi-domain, multi- domain to specific domain and specific-domain to specific-domain dataset property alignment.  In every experiment, the extension based algorithm outperformed others (F measure). F measure gain is in the range of 57% to 78%.  Some properties that are identified are intentionally different, e.g., db:distributor vs fb:production_companies. – This is because many companies produce and also distribute their films.  Some identified pairs are incorrect due to errors in data modeling. – For example, db:issue and fb:children.  owl:sameAs linking issues in LOD (not linking exact same thing), e.g., linking London and Greater London. – We believe few misused links wont affect the algorithm as it decides on a match after analyzing many matches for a pair. 09/06/2013iSemantics Discussion, interesting facts, and future directions

 Less number of interlinks. – Evolve over time. – Look for possible other types of ECR links (i.e., rdf:seeAlso).  Properties do not have uniform distribution in a dataset. – Hence, some properties do not have enough matches or appearances. – This is due to rare classes and domains they belong to. – We can run the algorithm on instances that these less frequent properties appear iteratively.  Current limitations, – Requires ECR links – Requires overlapping datasets – Object-type properties – Inability to identify property – sub property relationships 09/06/2013iSemantics

 Background  Statistical Equivalence of properties  Evaluation  Discussion, interesting facts, and future directions  Conclusion 09/06/ Roadmap iSemantics 2013

 We approximate owl:equivalentProperty using Statistical Equivalence of properties by analyzing property extensions, which is schema independent.  This novel extension based approach works well with interlinked datasets.  The extension based approach outperforms syntax or dictionary based approaches. F measure gain in the range of 57% - 78%.  It requires many comparisons, but can be easily parallelized evidenced by our Map-Reduce implementation. 09/06/2013iSemantics Conclusion

26 Thank You Kno.e.sis – Ohio Center of Excellence in Knowledge-enabled Computing Wright State University, Dayton, Ohio, USA Questions ? 09/06/2013iSemantics 2013