Similarity Measures in Deep Web Data Integration

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

Indexing DNA Sequences Using q-Grams
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Konstanz, Jens Gerken ZuiScat An Overview of data quality problems and data cleaning solution approaches Data Cleaning Seminarvortrag: Digital.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Maurice Hermans.  Ontologies  Ontology Mapping  Research Question  String Similarities  Winkler Extension  Proposed Extension  Evaluation  Results.
D UPLICATE RECORD DETECTION AHMED K. ELMAGARMID PURDUE UNIVERSITY, WEST LAFAYETTE, IN Senior member, IEEE PANAGIOTIS G. IPEIROTIS LEONARD N. STERN SCHOOL.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Data Quality Class 7. Agenda Record Linkage Data Cleansing.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence similarity.
Towards Semantic Web: An Attribute- Driven Algorithm to Identifying an Ontology Associated with a Given Web Page Dan Su Department of Computer Science.
Mapping Techniques and Visualization of Statistical Indicators Haitham Zeidan Palestinian Central Bureau of Statistics IAOS 2014 Conference.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
A Fast Regular Expression Indexing Engine Junghoo “John” Cho (UCLA) Sridhar Rajagopalan (IBM)
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Chapter 5: Information Retrieval and Web Search
A Privacy Preserving Efficient Protocol for Semantic Similarity Join Using Long String Attributes Bilal Hawashin, Farshad Fotouhi Traian Marius Truta Department.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
A Statistical and Schema Independent Approach to Identify Equivalent Properties on Linked Data † Kno.e.sis Center Wright State University Dayton OH, USA.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Webpage Understanding: an Integrated Approach
Developing Pairwise Sequence Alignment Algorithms
Boyce/DiPrima 9th ed, Ch 8.4: Multistep Methods Elementary Differential Equations and Boundary Value Problems, 9th edition, by William E. Boyce and Richard.
Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Chapter 6: Information Retrieval and Web Search
Interoperable Visualization Framework towards enhancing mapping and integration of official statistics Haitham Zeidan Palestinian Central.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
1 Resolving Schematic Discrepancy in the Integration of Entity-Relationship Schemas Qi He Tok Wang Ling Dept. of Computer Science School of Computing National.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Searching the Web Basic Information Retrieval. Who I Am  Associate Professor at UCLA Computer Science  Ph.D. from Stanford in Computer Science  B.S.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Minimum Edit Distance Definition of Minimum Edit Distance.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Distance functions and IE - 3 William W. Cohen CALD.
SERVICE ANNOTATION WITH LEXICON-BASED ALIGNMENT Service Ontology Construction Ontology of a given web service, service ontology, is constructed from service.
Efficient Approximate Search on String Collections Part I
Big Data Project Group 10 Job Position Data Portal
Clustering of Web pages
3.3. Case-Based Reasoning (CBR)
Supervised Time Series Pattern Discovery through Local Importance
Web Data Extraction Based on Partial Tree Alignment
A research literature search engine with abbreviation recognition
School of Computer Science & Engineering
Clustering Algorithms for Noun Phrase Coreference Resolution
Text Joins in an RDBMS for Web Data Integration
Efficient Record Linkage in Large Data Sets
Space-for-time tradeoffs
CS246: Information Retrieval
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Extracting Patterns and Relations from the World Wide Web
CS639: Data Management for Data Science
Presentation transcript:

Similarity Measures in Deep Web Data Integration Fangjiao Jiang

Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Similarity measure — an essential point in data integration Variations from: Representation Typographical errors, misspellings, abbreviations, etc Extraction From unstructured or semi-unstructured documents or web pages 44 W. 4th St. 44 West Fourth Street Smith Smoth Abroms Abrams “KFC" “Kentucky Fried Chicken" "R. Smith" " Richard Smith"

Similarity measure — an essential point in data integration Similarity measure will be applied to: Keyword search From keyword query interface to structured query interface Schema matching From integrated query interface to local query interface Result merge Duplicate records detection (field level ) Q ={ , , …} Integrated Interface={ , ,…} Local Interface={ , ,…} key1 key2 Record1 Record1 <C1,V1> <C2,V2> Record2 Record2 Record3 Record3 <L1,v1> <L2,v2> Record4 Record4

Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Similarity methods Similarity methods String Similarity Numeric Data Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance

Edit distance Edit distance, a.k.a. Levenshtein distance Example1: The minimum number of edit operations (insertions, deletions, and substitutions) of single characters needed to transform the string S1 into S2. Problem: last names, first names, and street names that did not agree on a character-by-character basis For example: Similarity(John R.Smith,Johathan Richard Smith)=11 Example1: S1:unne cessarily Edit distance(S1,S2)=3 S2:un escessaraly O(|S1| ,|S2|)

Affine gap distance Two extra edit operations: open gap and extend gap cost(g) =s + e × l ( e<s ), where s is the cost of opening a gap, e is the cost of extending a gap, and l is the length of a gap in the alignment of two strings Example2 (Affine gap distance): This method is better when matching strings have been truncated or shortened "J. R. S m i t h“ " J o h n R i c h a r d S m i t h "

Smith-Waterman distance Extension of edit distance and affine gap distance Mismatches at the beginning and the end of strings have lower cost than mismatches in the middle. Example 3 : “Prof.John R.Smith,University of Calgary“ “John R.Smith, Prof"

Jaro distance metric(1) Jaro(s1,s2) = 1/3( #common/str_len(S1) +#common/str_len(S2) +0.5 #transpositions/#common) Example 4 : Mainly used for comparison of last and first name. "John R.Smith" " Johathan Richard Smith."

Jaro distance metric(2) The first enhancement: Jaro(s1,s2) = 1/3( #common/str_len1 +#common/str_len2 +0.3 #similar/#common +0.5 #transpositions/#common) Example: scanning errors ("1" versus "l") keypunch ("V" versus "B") The second enhancement: Agreement in the first few characters of a string is more important than agreement on the last few. Jaro’= Jaro+i*0.1*(1- Jaro) For example Jaro’ (abroms,abrams)=0.9333>Jaro’(lampley, campley)=0.9048 The study showed that the fewer errors typically occur at the beginning of a string and the error rates by character position increase monotonically as the position moves to the right. Abroms Abrams Lampley Campley

Q-gram distance Let q be an integer. Given a string s, the set of q-grams of s, denoted G(s), is obtained by sliding a window of length q over the characters of strings. For example, if q = 2: G(“Harrison Ford”) = {’Ha’, ’ar’, ’rr’, ’ri’, ’is’, ’so’, ’on’, ’n ’, ’F’, ’Fo’, ’or’, ’rd’}. G(“Harison Fort”) = {’Ha’, ’ar’, ’ri’, ’is’, ’so’, ’on’, ’n ’, ’ F’, ’Fo’, ’or’, ’rt’}. Similarity(s1, s2) = 1 − |G(s1) ∩ G(s2)|/ |G(s1) ∪ G(s2)| Similarity(“Harrison Ford”, “Harison Fort”) = 1 – 10/13 ≈ 0.23

Smith-Waterman distance Character-based Edit distance Affine gap distance Smith-Waterman distance Jaro distance metric Q-gram distance character-based metrics Advantages: work well for estimating distance between strings that differ due to typographical errors or abbreviations Disadvantages: expensive and less accurate for larger strings Token-based metric View string as “bags of tokens” and disregarding the order in which the tokens occur. Token-based Atomic strings WHIRL Q-grams with tf.idf

WHIRL Separate each string into words and each word w is assigned a weight: For example: “AT&T” or “IBM” will have higher weights “Inc” will have higher weights The cosine similarity of s1 and s2 is defined as “John Smith” and “Mr. John Smith” would have similarity close to one. Problem: “Compter Science Department” and “Deprtment of Computer Scence” will have zero similarity.

Q-grams with tf.idf Extend the WHIRL system to handle spelling errors by using q-grams, instead of words, as tokens. For example: Similarity (Gteway Communications, Comunications Gateway) is high a spelling error minimally affects the set of common q-grams of two strings, so the two strings “Gteway Communications” And “Comunications Gateway” have high similarity under this metric. a m

disregarding the order Similarity methods Shorter strings Longer strings String Similarity Numeric Similarity Character-based Token-based Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance disregarding the order Prefix suffix abbreviation Minor variation typographical errors

Outline Motivation Brief Review on Existing Similarity Measures Challenges and Our Solutions Conclusion

Challenges (1) Similarity methods String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Which one should be chosen for a particular data domain? Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance Current works about string similarity mainly adopt Edit distance method. Due to different features of different fields, accurate similarity computations require appropriate string similarity metric for each field of database with respect to the particular data domain.

Smith-Waterman distance Challenges (2) Similarity methods String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance Current search engines treat numbers as strings, ignoring their numeric values. For example, the search for 6798.32 on Google yielded two pages that correctly associate this number with the lunar nutation cycle. However, the search for 6798.320 on Google found no page. The search for 6798.320 on AltaVista,AOL, HotBot, Lycos, MSN, Netscape, Overture, and Yahoo! Also did not find any page about the lunar nutation cycle.

Numeric Data Similarity Measures Features: Relative value A set of discrete numbers vs. another set of discrete numbers ( like Q-gram distance) A range value vs. another range value (Overlap degree ) The maximal, the average value of the numbers

semantic heterogeneity Challenges (3) Similarity methods lexical heterogeneity semantic heterogeneity String Similarity Numeric Similarity Numeric Similarity Character-based Token-based Treated as strings Treated as strings Edit distance Atomic strings Affine gap distance WHIRL Smith-Waterman distance Q-grams with tf.idf Jaro distance metric Q-gram distance

Example lexical heterogeneity Smith 44 W. 4th St. Smoth 44 West Fourth Street Abroms Abrams "John R.Smith" " Johathan Richard Smith." Lampley Campley semantic heterogeneity "John Smith" "Smith, John." President of the U.S. George W. Bush. “Prof.John R.Smith" "John R.Smith, Prof" 1 ATT Way Bedminster NJ 900 Route 202/206 Bedminster NJ Departure Leaving from

Semantic heterogeneity WordNet

Semantic heterogeneity WordNet Semantic relationship Synonymy hyponymy,hypernym meronymy WordNet 1.7.1 Noun 109195 Verb 11088 Adjective 21460 Adverb 4607 Totals 146350 Construct semantic relationship manually

Conclusion Similarity measure is an essential point in data integration. Which string similarity should be chosen from existing methods for a particular data domain? We need effective numeric data similarity measures. We need some ontology tools to solve the semantic heterogeneity. Similarity measure will be applied to: Keyword search Schema matching Result merge