Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Viswanathan and Tim Finin, University of Maryland, Baltimore County Motivation Case 3: Different versions of the same SW graph In addition, when this case is detected, generate a delta between the two versions Classification Text similarity is very useful in information retrie-val for near duplicate and similarity detection Similarity metrics computed for each candidate pair Approach Naïve Bayes/SVM classifier: Difference only in Base-URI Naïve Bayes Classifier: Similarity in classes and properties SVM Classifier: Versioning Relationship Input: corpus of SWDs Convert to canonical form Convert to n-triples format Problem Identify pairs of similar documents Compute Text-Based Similarity Metrics Create Reduced Forms Generating Deltas Given a collection of SW graphs as RDF doc-uments, identify pairs of graphs that are similar Generate a delta for pairs of graphs identified as having a versioning relationship Version1 Except Version2 Subtractive Delta Version2 Except Version1 Additive Delta Delta Generate delta between versions Identify ontology versions Contributions Defined text-based similarity metrics char-acterizing relations between SW graphs Evaluated these metrics for three specific cases of similarity SW Graph Canonicalization <person:John> <a:livesIn> _:x . _:x <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” . _:x <a:hasCapital> :y . “~” <a:hasCapital> “~” . # _:x _:y “~” <a:IsPartOf> ”USA” . # _:x <person:John> <a:likes> ”cheese” . <person:John> <a:livesIn> “~” . #_:x Evaluation Case 1: Same classes and properties used but differ only in literal content Three datasets of 400+ semantic web documents for training and testing 17 combinations of similarity metrics tested: Jaccard, Containment, Cosine similarity, Hamming distance between Simhash fingerprints BNode Table _:g2 <a:hasCapital> _:g1 . _:g2 <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” . <person:John> <a:livesIn> _:g2 . Old bnode identifier New bnode identifier _:y _:g1 _:x _:g2 Assigns uniform identifiers to blank nodes Provides a deterministic order to statements Empirical method that works for most examples Type of Similarity True Positives False Positives Precision Recall Similarity in classes & properties 0.986 0.014 0.987 Difference only in base URI 0.988 0.012 Versioning Relationship 0.909 0.091 0.913 Four reduced forms Case 2: Differ only in base-URI Only literals from the original n-triple file All non-literal content from original n-triple file Base-URI of every node replaced by “” Literals and base-URIs replaced by “” UMBC AN HONORS UNIVERSITY IN MARYLAND