A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Similarity and Distance Sketching, Locality Sensitive Hashing
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Tools for Text Review. Algorithms The heart of computer science Definition: A finite sequence of instructions with the properties that –Each instruction.
Scott julian Xiaojie Jiang Dr. Ngu EARTH MOVER’S WEB SERVICE SEARCHER E.M.W.S.S.
Aki Hecht Seminar in Databases (236826) January 2009
Heuristic alignment algorithms and cost matrices
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Data Quality Class 7. Agenda Record Linkage Data Cleansing.
Text Comparison of Genetic Sequences Shiri Azenkot Pomona College DIMACS REU 2004.
Improving Similarity Join Algorithms using Vertical Clustering Techniques Lisa Tan Department of Computer Science Computing & Information Technology Wayne.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
Recommender systems Ram Akella November 26 th 2008.
Modern Information Retrieval Chapter 4 Query Languages.
Recap Don’t forget to – pick a paper and – me See the schedule to see what’s taken –
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Comparing protein structure and sequence similarities Sumi Singh Sp 2015.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
L. Padmasree Vamshi Ambati J. Anand Chandulal J. Anand Chandulal M. Sreenivasa Rao M. Sreenivasa Rao Signature Based Duplicate Detection in Digital Libraries.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
1 CSA4050: Advanced Topics in NLP Spelling Models.
Assignment 2: Papers read for this assignment Paper 1: PALMA: mRNA to Genome Alignments using Large Margin Algorithms Paper 2: Optimal spliced alignments.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Chapter 6: Information Retrieval and Web Search
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Distance functions and IE – 4? William W. Cohen CALD.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Introduction to String Kernels Blaz Fortuna JSI, Slovenija.
Information Integration Entity Resolution – 21.7 Presented By: Deepti Bhardwaj Roll No: 223_103.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Automatic Assignment of Biomedical Categories: Toward a Generic Approach Patrick Ruch University Hospitals of Geneva, Medical Informatics Service, Geneva.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Chen Li Department of Computer Science Joint work with Liang Jin, Nick Koudas, Anthony Tung, and Rares Vernica Answering Approximate Queries Efficiently.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
WNSpell: A WordNet-Based Spell Corrector BILL HUANG PRINCETON UNIVERSITY Global WordNet Conference 2016Bucharest, Romania.
CS307P-SYSTEM PRACTICUM CPYNOT. B13107 – Amit Kumar B13141 – Vinod Kumar B13218 – Paawan Mukker.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
Real-World Data Is Dirty Data Cleansing and the Merge/Purge Problem M. Hernandez & S. Stolfo: Columbia University Class Presentation by Jeff Maynard.
Distance functions and IE - 3 William W. Cohen CALD.
Dynamic Programming for the Edit Distance Problem.
Query Languages.
School of Computer Science & Engineering
Single-Source All-Destinations Shortest Paths With Negative Costs
String matching.
Single-Source All-Destinations Shortest Paths With Negative Costs
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Database Design and Programming
Lecture 6: Counting triangles Dynamic graphs & sampling
Minwise Hashing and Efficient Search
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Similarity Measures in Deep Web Data Integration
Presentation transcript:

A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg

Motivating Example List of people and some attributes compiled by one source Updates by another source need to be merged Need to locate matching records Forcing exact match not sufficient –Typographical errors (letter “B” vs. letter “V”) –Scanning errors (letter “I” vs. numeral “1”) –Such errors exceed 20% in some cases Decide when two records match  Decide when two strings (or words) are identical

History – String Matching Statistics –Treat as a classification problem [Fellegi & Sunter] –Use of other prior knowledge String represented as a feature vector Databases –No prior knowledge Use of distance functions – edit distance, Monge & Elkan, TFIDF –Knowledge-intensive approaches User interaction [Hernandez & Stolfo] Artificial Intelligence –Learn the parameters of the edit distance functions –Combine the results of different distance functions Compare string matching distance functions for the task of name matching

Edit Distance Number of edit operations needed to go from string s to string t Operations: insert, delete, substitution Levenstein: assigns unit cost –Distance (“smile”, “mile”) = 1 –Distance (“meet”, “meat”) = 1 Computed by dynamic programming Reordering of words can be misleading –“Cohen, William” vs. “William Cohen”

Edit Distance Monger-Elkan: assigns relatively lower cost to sequence of insertions or deletions –A + B*(n – 1) for n insertions or deletions (B < A) Other methods that assign decreasing costs to subsequent insertions

Edit Distance Jaro (s, t) –s’ be characters in s common with t –t’ be characters in t common with s –T (s’, t’) be half the number of transpositions in for s’ and t’

Improvements to Jaro McLaughlin –Exact match – weight of 1.0 –Similar characters – weight of 0.3 Scanning error (“I” vs. “1”) Typographical error (“B” vs. “V”) Pollock and Zamora –Error rates increase as the position in string moves to the right –Adjust output of Jaro by fixed amount depending upon how many of the first 4 characters match

Term Based Treat strings s & t as bags S and T of words Examples –Jaccard similarity = |S∩T| / |SUT| –TFIDF

Term Based Words may be weighted to make the common words count less Advantages –Exploits frequency information –Ordering of words doesn’t matter (Cohen, William vs. William Cohen) Disadvantages –Sensitive to errors in spelling (Cohen vs. Cohon) and abbreviations (Univ. vs. University) –Ordering of words ignored (City National Bank vs. National City Bank)

Hybrid Distance Functions Recursive Matching –Let s = (a 1, a 2, … a K ) and t = (b 1, b 2, …, b L ) –Sim’ is the level two matching function

Blocking / Pruning Methods Comparing all pairs – too expensive when lists are large A pair (s, t) is a candidate for match if they share some substring v that appears in at most a fraction f of all names Using a v of length 4 and f = 1% finds on an average of 99% correct pairs

Results - Metric Output of each algorithm is a list of candidate pairs ranked by distance Non-interpolated average precision of a ranking Other metrics used –Interpolated precision

Results - Matching Term based: TFIDF most accurate Edit distance based: Monge-Elkan most accurate Jaro as accurate as Monge-Elkan, but much faster Combine TFIDF and Jaro