Similarity Measures in Deep Web Data Integration Fangjiao Jiang
Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Similarity measure — an essential point in data integration Variations from: Representation Typographical errors, misspellings, abbreviations, etc Extraction From unstructured or semi-unstructured documents or web pages 44 W. 4th St. 44 West Fourth Street Smith Smoth Abroms Abrams “KFC" “Kentucky Fried Chicken" "R. Smith" " Richard Smith"
Similarity measure — an essential point in data integration Similarity measure will be applied to: Keyword search From keyword query interface to structured query interface Schema matching From integrated query interface to local query interface Result merge Duplicate records detection (field level ) Q ={ , , …} Integrated Interface={ , ,…} Local Interface={ , ,…} key1 key2 Record1 Record1 <C1,V1> <C2,V2> Record2 Record2 Record3 Record3 <L1,v1> <L2,v2> Record4 Record4
Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance
Edit distance Edit distance, a.k.a. Levenshtein distance Example1: The minimum number of edit operations (insertions, deletions, and substitutions) of single characters needed to transform the string S1 into S2. Problem: last names, first names, and street names that did not agree on a character-by-character basis For example: Similarity(John R.Smith,Johathan Richard Smith)=11 Example1: S1:unne cessarily Edit distance(S1,S2)=3 S2:un escessaraly O(|S1| ,|S2|)
Affine gap distance Two extra edit operations: open gap and extend gap cost(g) =s + e × l ( e<s ), where s is the cost of opening a gap, e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings Example2 (Affine gap distance): This method is better when matching strings have been truncated or shortened "J. R. S m i t h“ " J o h n R i c h a r d S m i t h "
Smith-Waterman distance Extension of edit distance and affine gap distance Mismatches at the beginning and the end of strings have lower cost than mismatches in the middle. Example 3 : “Prof.John R.Smith,University of Calgary“ “John R.Smith, Prof"
Jaro distance metric(1) Jaro(s1,s2) = 1/3( #common/str_len(S1) +#common/str_len(S2) +0.5 #transpositions/#common) Example 4 : Mainly used for comparison of last and first name. "John R.Smith" " Johathan Richard Smith."
Jaro distance metric(2) The first enhancement: Jaro(s1,s2) = 1/3( #common/str_len1 +#common/str_len2 +0.3 #similar/#common +0.5 #transpositions/#common) Example: scanning errors ("1" versus "l") keypunch ("V" versus "B") The second enhancement: Agreement in the first few characters of a string is more important than agreement on the last few. Jaro’= Jaro+i*0.1*(1- Jaro) For example Jaro’ (abroms,abrams)=0.9333>Jaro’(lampley, campley)=0.9048 The study showed that the fewer errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. Abroms Abrams Lampley Campley
Q-gram distance Let q be an integer. Given a string s, the set of q-grams of s, denoted G(s), is obtained by sliding a window of length q over the characters of strings. For example, if q = 2: G(“Harrison Ford”) = {’Ha’, ’ar’, ’rr’, ’ri’, ’is’, ’so’, ’on’, ’n ’, ’F’, ’Fo’, ’or’, ’rd’}. G(“Harison Fort”) = {’Ha’, ’ar’, ’ri’, ’is’, ’so’, ’on’, ’n ’, ’ F’, ’Fo’, ’or’, ’rt’}. Similarity(s1, s2) = 1 − |G(s1) ∩ G(s2)|/ |G(s1) ∪ G(s2)| Similarity(“Harrison Ford”, “Harison Fort”) = 1 – 10/13 ≈ 0.23
Smith-Waterman distance Character-based Edit distance Affine gap distance Smith-Waterman distance Jaro distance metric Q-gram distance character-based metrics Advantages: work well for estimating distance between strings that differ due to typographical errors or abbreviations Disadvantages: expensive and less accurate for larger strings Token-based metric View string as “bags of tokens” and disregarding the order in which the tokens occur. Token-based Atomic strings WHIRL Q-grams with tf.idf
WHIRL Separate each string into words and each word w is assigned a weight: For example: “AT&T” or “IBM” will have higher weights “Inc” will have higher weights The cosine similarity of s1 and s2 is defined as “John Smith” and “Mr. John Smith” would have similarity close to one. Problem: “Compter Science Department” and “Deprtment of Computer Scence” will have zero similarity.
Q-grams with tf.idf Extend the WHIRL system to handle spelling errors by using q-grams, instead of words, as tokens. For example: Similarity (Gteway Communications, Comunications Gateway) is high a spelling error minimally affects the set of common q-grams of two strings, so the two strings “Gteway Communications” And “Comunications Gateway” have high similarity under this metric. a m
disregarding the order Similarity methods Shorter strings Longer strings String Similarity Numeric Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance disregarding the order Prefix suffix abbreviation Minor variation typographical errors
Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion
Challenges (1) Similarity methods String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Which one should be chosen for a particular data domain? Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance Current works about string similarity mainly adopt Edit distance method. Due to different features of different fields, accurate similarity computations require appropriate string similarity metric for each field of database with respect to the particular data domain.
Smith-Waterman distance Challenges (2) Similarity methods String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance Current search engines treat numbers as strings, ignoring their numeric values. For example, the search for 6798.32 on Google yielded two pages that correctly associate this number with the lunar nutation cycle. However, the search for 6798.320 on Google found no page. The search for 6798.320 on AltaVista,AOL, HotBot, Lycos, MSN, Netscape, Overture, and Yahoo! Also did not find any page about the lunar nutation cycle.
Numeric Data Similarity Measures Features: Relative value A set of discrete numbers vs. another set of discrete numbers ( like Q-gram distance) A range value vs. another range value (Overlap degree ) The maximal, the average value of the numbers
semantic heterogeneity Challenges (3) Similarity methods lexical heterogeneity semantic heterogeneity String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance
Example lexical heterogeneity Smith 44 W. 4th St. Smoth 44 West Fourth Street Abroms Abrams "John R.Smith" " Johathan Richard Smith." Lampley Campley semantic heterogeneity "John Smith" "Smith, John." President of the U.S. George W. Bush. “Prof.John R.Smith" "John R.Smith, Prof" 1 ATT Way Bedminster NJ 900 Route 202/206 Bedminster NJ Departure Leaving from
Semantic heterogeneity WordNet
Semantic heterogeneity WordNet Semantic relationship Synonymy hyponymy,hypernym meronymy WordNet 1.7.1 Noun 109195 Verb 11088 Adjective 21460 Adverb 4607 Totals 146350 Construct semantic relationship manually
Conclusion Similarity measure is an essential point in data integration. Which string similarity should be chosen from existing methods for a particular data domain? We need effective numeric data similarity measures. We need some ontology tools to solve the semantic heterogeneity. Similarity measure will be applied to: Keyword search Schema matching Result merge