Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1.

Slides:

Advertisements

Similar presentations

Dr. Leo Obrst MITRE Information Semantics Information Discovery & Understanding Command & Control Center February 6, 2014February 6, 2014February 6, 2014.

Advertisements

Chapter 5: Introduction to Information Retrieval

+ Multi-label Classification using Adaptive Neighborhoods Tanwistha Saha, Huzefa Rangwala and Carlotta Domeniconi Department of Computer Science George.

Linked data: P redicting missing properties Klemen Simonic, Jan Rupnik, Primoz Skraba {klemen.simonic, jan.rupnik,

Large-Scale Entity-Based Online Social Network Profile Linkage.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)

Information Retrieval in Practice

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

Detecting Near Duplicates for Web Crawling Authors : Gurmeet Singh Mank Arvind Jain Anish Das Sarma Presented by Chintan Udeshi 6/28/ Udeshi-CS572.

Vector Space Model CS 652 Information Extraction and Integration.

Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.

Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.

Swoogle Swoogle Semantic Search Engine Web-enhanced Information Management Bin Wang.

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Overview of Search Engines

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Information theory, fitness and sampling semantics colin johnson / university of kent john woodward / university of stirling.

OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR

Advanced Multimedia Text Classification Tamara Berg.

Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Graph-RAT Overview By Daniel McEnnis. 2/32 What is Graph-RAT  Relational Analysis Toolkit  Database abstraction layer  Evaluation platform  Robustly.

Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.

The identification of interesting web sites Presented by Xiaoshu Cai.

FINDING NEAR DUPLICATE WEB PAGES: A LARGE- SCALE EVALUATION OF ALGORITHMS - Monika Henzinger Speaker Ketan Akade 1.

Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.

IDB, SNU Dong-Hyuk Im Efficient Computing Deltas between RDF Models using RDFS Entailment Rules (working title)

Querying Structured Text in an XML Database By Xuemei Luo.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

1 Learning Sub-structures of Document Semantic Graphs for Document Summarization 1 Jure Leskovec, 1 Marko Grobelnik, 2 Natasa Milic-Frayling 1 Jozef Stefan.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Problems in Semantic Search Krishnamurthy Viswanathan and Varish Mulwad {krishna3, varish1} AT umbc DOT edu 1.

Spam Detection Ethan Grefe December 13, 2013.

CISC Machine Learning for Solving Systems Problems Presented by: Ashwani Rao Dept of Computer & Information Sciences University of Delaware Learning.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Algorithmic Detection of Semantic Similarity WWW 2005.

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Using linked data to interpret tables Varish Mulwad September 14,

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.

Presented By- Shahina Ferdous, Student ID – , Spring 2010.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.

Nuhi BESIMI, Adrian BESIMI, Visar SHEHU

Application of latent semantic analysis to protein remote homology detection Wu Dongyin 4/13/2015.

Class Imbalance in Text Classification

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Selected Semantic Web UMBC CoBrA – Context Broker Architecture  Using OWL to define ontologies for context modeling and reasoning  Taking.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Wen Chan 1 ， Jintao Du 1, Weidong Yang 1, Jinhui Tang 2, Xiangdong Zhou 1 1 School of Computer Science, Shanghai Key Laboratory of Data Science, Fudan.

Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.

Setting the stage: linked data concepts Moving-Away-From-MARC-a-thon.

Information Retrieval in Practice

A Simple Approach for Author Profiling in MapReduce

Learning to Detect and Classify Malicious Executables in the Wild by J

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Clustering of Web pages

Near Duplicate Detection

Text Based Similarity Metrics and Delta for Semantic Web Graphs

Text Categorization Assigning documents to a fixed set of categories

Detecting Phrase-Level Duplication on the World Wide Web

[jws13] Evaluation of instance matching tools: The experience of OAEI

Learning Term-weighting Functions for Similarity Measures

Chapter 7: Transformations

Presentation transcript:

Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28,

Contributions Define text-based similarity metrics that characterize the relationship between semantic web graphs Evaluate the similarity metrics for three specific cases of similarity that we defined Generate a delta between pairs of SW graphs that may be two versions of the same graph Prototyped the techniques in a new system called Similis 2 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Motivation: Near Duplicate Detection for the SW? 3 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Goals Explore the different ways in which two SW graphs may be similar to each other In particular, evaluate the specific use case of versioning relations between SW graphs Additionally, develop techniques to generate a delta between versions 4 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Comparison with near duplicate text document detection In a text document: – Order of the content is important – The meaning of the text is not a part of the problem, just the textual encoding of the meaning For a SWD, the order is not deterministic i.e. equivalent SWDs may have different statement orderings Non-deterministic blank node identifiers 5 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Semantic Web Document (SWD) RDF representation of a Semantic Web Graph – Document based serialization of a SW graph on the web (ontology or data-file) – Document based serialization of the result of a SPARQL query on a triple-store – Document based serialization of structured metadata extracted from an HTML page using RDFa 6 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Semantic Web Graph Similarity The archive or the Swoogle search engine (Ding et al. 2004) shows several examples of how ontologies and RDF documents evolve over time Kinds of similarity between two SW graphs: – Same classes and properties used. Differ only in literal content – Different only in base-URIs of entities used – Different versions of the same semantic web graph 7 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Similarity in Classes and Properties Two semantic web graphs that differ only in the literal content 8 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Different in Literal Content. “Eric Miller”. “Dr”.. “John Doe”. “. “Mr”. 9 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Different only in base-URI 10 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Different only in base-URI. _:g103. _:g104 _:g103. _:g104. _:g105.. _:g103. _:g104 _:g103. _:g104. _:g Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Versioning Relationship Two semantic web documents have a versioning relationship, if they are variants of the same semantic web graph. Variants are created due to the dynamic nature of the web, i.e. content keeps getting modified – Minor changes: spelling corrections, punctuations etc – Major changes: Affect the semantic content 12 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Problem Definition Problem 1: Given a collection of semantic web graphs in the form of RDF documents, characterize the similarity between pairs into one or more of the three cases: – Same classes and properties used, but differ only in the literal content – Differ only in the base-URI used – Are different versions of the same graph i.e. have a versioning relationship 13 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Problem Definition Problem 2: Generate a delta between pairs that have been identified as having a versioning relationship between them. 14 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Approach Input: Corpus of SWDs Convert to n- triples format Convert to canonical form Generate Reduced Forms Compute Text- Based Similarity Metrics Characterize similarity between pairs Identify versions Generate delta between versions Build feature- vectors for each pair 15 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Convert to n-triples 16 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Convert to Canonical Form Comparison methods may be affected by blank node identifiers and statement ordering Canonicalization assigns consistent IDs to blank nodes and orders the statements lexicographically. Transforms two semantically equivalent graphs into the same canonical representation 17 Based on: Carroll, J. J Signing RDF graphs. In In 2nd ISWC, volume 2870 of LNCS, 5–15. Springer. Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Convert to Canonical Form _:x. _:x ”USA”. ”cheese”. _:x :y. “~” “~”. # _:x _:y “~” ”USA”. # _:x ”cheese”. “~”. #_:x Old Blank Node Identifier New Blank Node Identifier _:y_:g1 _:x_:g2 _:g2 _:g1. _:g2 ”USA”. ”cheese”. _:g2. BNode Table 18 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Limitation of the Algorithm: Non-Distinctive Triples The algorithm can only deal with graphs that do not have non-distinctive triples Non Distinctive Triples: The triples in the graph that cannot be uniquely identified when all the blank nodes are treated as equal 19 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Graphs with Non-Distinctive Triples For a group of n non-distinct triples, there are n! ways of renaming the blank nodes For graphs with non-distinctive triples, a single unique canonical form does not exist To compare two graphs, compare each of the possible canonical forms for both graphs Number of comparisons: O(m!n!) Similis throws an exception when it finds a graph with multiple forms 20 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Graphs with Non-Distinctive Triples Only a small percentage of SW graphs (13%) did not have a unique canonical form (1200 randomly collected SW documents) 21 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Generating Reduced Forms The canonical form of each SW graph is broken down into a number of reduced forms These reduced forms are used to characterize the relationship between pairs of SW graphs The following is the anatomy of a triple: Entity URI Base URI Local Name 22 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Only-Literals Reduced Form Contains only the literals from the original n-triples file. Lets us compare only the textual content within a graph, separated from the rest of the graph 23 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

No-Literals Reduced Form All the literals from the canonical form are replaced by an empty string Lets us compare only the classes and properties used, regardless of literal content 24 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Local-Name Reduced Form The base-URI of every node in the canonical form is replaced by an empty string Lets us compare only the local names of the classes and properties used 25 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Local-Name-No-Literal Reduced Form All the literals, and the base-URI of every node is replaced by an empty string Lets us compare the non-literal content of two SW graphs 26 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Similarity/Distance Metrics Used Cosine Similarity between SWD vectors Jaccard and Containment Metrics Hamming Distance between Simhash fingerprints 27 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Computation of Pairwise Metrics Compute cosine similarity between the canonical, and local forms of each pair in the collection – If cosine similarity < 0.7, remove pair from further consideration – Else, compute all other metrics for all the forms (5 forms * 3 metrics = 15 specific metrics) Total of 17 metrics computed 28 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Cosine Similarity Between Term Vectors Each SWD containing terms T j = {t 1, t 2 …t n } is treated as a vector V j = (γ 1 t 1,γ 2 t 2,… γ n t n ) where each γ i is the weight associated with term t i Non-blank, non-literal nods are used as features, and Term Frequency (TF) is used as weight Two vectors for each SWD: one uses full entity URIs as features, other uses local-name of terms Indicates similarity in classes and properties 29 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

SW Document Vectors TermFreq TermFreq Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Jaccard and Containment Computed for all forms (five) for a candidate pair of SW graphs (5 * 2 = 10 metrics) Construct sets of character 4-grams for each document 4-grams are computed by running a four character- wide window over the text representation 31 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Hamming Distance between Simhash Fingerprints Simhash fingerprints of similar documents differ in a small number of bit positions Tokenize documents into character 3-grams Compute simhash fingerprint for each document in pair (we implemented 128 bit fingerprints) Find Hamming Distance between the fingerprints Computed for all forms (five) for a candidate pair of SW graphs (5 * 1 = 5 metrics) 32 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Classification 33 Naïve Bayes Classifier: Similarity in classes and properties Similarity metrics computed for each candidate pair Naïve Bayes/SVM classifier: Difference only in Base-URI SVM Classifier: Versioning Relationship Feature Vector FV Feature Vector Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion Example feature vector used for determining versioning relationship

Computing Delta Between Two Versions 34 Version1 Except Version2 Subtractive Delta Version2 Except Version1 Additive Delta Version1 Version2 Delta SVM Classifier: Versioning Relationship Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Raw Delta Statement-by-statement comparison between canonical forms of the two SWDs Only local names of entities are compared 35 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Delta After Deductive Closure 36 SWGv1 SWGv2 Compute deductive closure Canonicalize Generate Raw Delta Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Delta After Deductive Closure If O is a set of propositions, p ԑ O and p ╞ q, then q ԑ O 37 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Delta at Concept Level 38 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Delta at Concept Level Works only for ontologies Groups of class/property definitions are serialized into individual graphs Corresponding graphs in the two versions are compared to each other 39 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Concept Level Delta: example 40 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Detecting Class Renaming 41 Sauterne Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Detecting Class Renaming Input: Local names of entites in both diffs Generate 3-gram sets for each entity Compute 3-gram overlap between sets in additive and subtractive deltas If overlap > 0.7, add (oldname, newname) to candidate set Replace oldname in subtractive delta by newname Check for presence of all modified statements in additive delta 42 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Detecting Class Renaming 43 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Data-set: Using Swoogle’s SW Wayback machine Swoogle caches multiple snapshots for each indexed semantic web document Labeling for versions: We extract such snapshots from Swoogle’s cache and label these pairs as versions 44 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs that Differ in Literal Content Features used for classification: – LocalNameCosineSim – CosineSim – LocalNameNoLiteralJaccard – LocalNameNoLiteralSimhash Training set from Swoogle archive included 806 positive pairs, and 806 negative pairs 45 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs that Differ in Literal Content Results of 10-fold stratified cross validation using a Naïve Bayes classifier: 46 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs that Differ in Literal Content Results of using a SVM with all of the features, instead of manually selecting features: Attribute relevance ranking: 47 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs that Differ in Base-URI Features for classification: – CosineSim – LocalNameCosineSim – LocalNameNoLiteralJaccard – LocalNameNoLiteralContainment – OnlyLiteralContainment – OnlyLiteralJaccard Training set contained 100 positive examples, and 100 negative examples 48 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs that Differ in Base-URI 10-fold cross validation using Naïve Bayes: 10-fold cross validation (SVM linear-kernel) 49 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs with a Versioning Relationship 124 training instances from Swoogle data-set Filtered highly dynamic pairs from consideration 50 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Evaluation: Pairs with a Versioning Relationship Test dataset: 160 instances (50% +ve 50% -ve) Classification results using SVM (linear kernel) 51 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Correctness of Delta Computation For any two versions of a SW graph, it holds that Δ x (K → K’)K ≡ K’ We check this condition programmatically for each delta generated 52 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Conclusion Define text-based similarity metrics that characterize the relationship between semantic web graphs Evaluate the similarity metrics for three specific cases of similarity that we defined Generate deltas between pairs of SW graphs that may be two versions of the same graph Prototyped the techniques in a new system called Similis 53 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Future Directions Scalability Content of Delta Generated Standard Ontologies to: – Describe delta – Describe the relationship between a pair of SW graphs Detecting direction of change between two versions 54 Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion