Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Advertisements

Indexing DNA Sequences Using q-Grams
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
A Comparison of String Matching Distance Metrics for Name-Matching Tasks William Cohen, Pradeep RaviKumar, Stephen Fienberg.
Probabilistic Record Linkage: A Short Tutorial William W. Cohen CALD.
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
Processing Strings with HMMs: Structuring text and computing distances William W. Cohen CALD.
Universiteit Utrecht BLAST CD Session 2 | Wednesday 4 May 2005 Bram Raats Lee Provoost.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Introduction to Bioinformatics Algorithms Sequence Alignment.
Data Quality Class 7. Agenda Record Linkage Data Cleansing.
Hinrich Schütze and Christina Lioma
Sequence similarity.
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Class 3: Estimating Scoring Rules for Sequence Alignment.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Class 2: Basic Sequence Alignment
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
Distance functions and IE -2 William W. Cohen CALD.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al. Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff.
Hashing Chapter 20. Hash Table A hash table is a data structure that allows fast find, insert, and delete operations (most of the time). The simplest.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
Distance functions and IE – 5 William W. Cohen CALD.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Comparative Study of Name Disambiguation Problem using a Scalable Blocking-based Framework Byung-Won On, Dongwon Lee, Jaewoo Kang, Prasenjit Mitra JCDL.
Distance functions and IE – 4? William W. Cohen CALD.
The Bloom Paradox Ori Rottenstreich Joint work with Yossi Kanizo and Isaac Keslassy Technion, Israel.
Expected accuracy sequence alignment Usman Roshan.
Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
Melodic Similarity Presenter: Greg Eustace. Overview Defining melody Introduction to melodic similarity and its applications Choosing the level of representation.
Ravello, Settembre 2003Indexing Structures for Approximate String Matching Alessandra Gabriele Filippo Mignosi Antonio Restivo Marinella Sciortino.
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Learning Analogies and Semantic Relations Nov William Cohen.
Edit Distances William W. Cohen.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
January 2012Spelling Models1 Human Language Technology Spelling Models.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
Record Linkage and Disclosure Limitation William W. Cohen, CALD Steve Fienberg, Statistics, CALD & C3S Pradeep Ravikumar, CALD.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Spell checking. Spelling Correction and Edit Distance Non-word error detection: – detecting “graffe” “ سوژن ”, “ مصواک ”, “ مداا ” Non-word error correction:
Distance functions and IE - 3 William W. Cohen CALD.
Learning to Align: a Statistical Approach
Edit Distances William W. Cohen.
Kernels for Relation Extraction
Minwise Hashing and Efficient Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD

Record linkage tutorial: review Introduction: definition and terms, etc Overview of Fellegi-Sunter –Unsupervised classification of pairs Main issues in Felligi-Sunter model –Modeling, efficiency, decision-making, string distance metrics and normalization Outside the F-S model –Search for good matches using arbitrary preferences Database hardening (Cohen et al KDD2000), citation matching (Pasula et al NIPS 2002)

Record linkage tutorial: review Introduction: definition and terms, etc Overview of Fellegi-Sunter –Unsupervised classification of pairs Main issues in Felligi-Sunter model –Modeling, efficiency, decision-making, string distance metrics and normalization Outside the F-S model –Search for good matches using arbitrary preferences Database hardening (Cohen et al KDD2000), citation matching (Pasula et al NIPS 2002) - Sometimes claimed to be all one needs -Almost nobody does record linkage without it

String distance metrics: overview Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t. Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. Pair HMM based metrics –Probabilistic extension of edit distance Other metrics

String distance metrics: term-based Term-based (e.g. TFIDF as in WHIRL) –Distance between s and t based on set of words appearing in both s and t. –Order of words is not relevant E.g, “ Cohen, William ” = “ William Cohen ” and “ James Joyce = Joyce James ” –Words are usually weighted so common words count less E.g. “ Brown ” counts less than “ Zubinsky ” Analogous to Felligi-Sunter ’ s Method 1

String distance metrics: term-based Advantages: –Exploits frequency information –Efficiency: Finding { t : sim(t,s)>k } is sublinear! –Alternative word orderings ignored (William Cohen vs Cohen, William) Disadvantages: –Sensitive to spelling errors (Willliam Cohon) –Sensitive to abbreviations (Univ. vs University) –Alternative word orderings ignored (James Joyce vs Joyce James, City National Bank vs National City Bank)

String distance metrics: Levenstein Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) –This is “ Levenstein distance ”

Levenstein distance - example distance( “ William Cohen ”, “ Willliam Cohon ” ) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment

Levenstein distance - example distance( “ William Cohen ”, “ Willliam Cohon ” ) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC s t op cost alignment gap

Computing Levenstein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete

Computing Levenstein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenstein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Computing Levenstein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M C C O H N A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Needleman-Wunch distance D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) William Cohen Wukkuan Cigeb G = “ gap cost ”

Smith-Waterman distance - 1 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete Distance is maximum over all i,j in table of D(i,j)

Smith-Waterman distance - 2 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 3 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M C C O H N

Smith-Waterman distance - 5 c o h e n d o r f m c c o h n s k i dist=5

Smith-Waterman distance: Monge & Elkan ’ s WEBFIND (1996)

Smith-Waterman distance in Monge & Elkan ’ s WEBFIND (1996) Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and found minimal cost over all possible pairings of the subfields (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.

Results: S-W from Monge & Elkan

Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘ Don ’ t call me Dubya ’ Cohen Intuitively, a single long insertion is “ cheaper ” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn ’ t “ cheaper ” than a lot of short insertions

Affine gap distances - 2 Idea: –Current cost of a “ gap ” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “ opening ” a gap, and B is cost of “ continuing ” a gap.

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘ gap ’ Best score in which tj is aligned with a ‘ gap ’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)

Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Goal is to match data like this:

Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “. ” HMM-based normalization to group title, author, booktitle, etc into fields

Affine gap distances – experiments TFIDFEdit Distance Adaptive Cora OrgName Orgname Restaurant Parks

String distance metrics: outline Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t. Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. Pair HMM based metrics –Probabilistic extension of edit distance Other metrics

Affine gap distances as automata -B -d(si,tj) D IS IT -d(si,tj) -A

Generative version of affine gap automata (Bilenko&Mooney, TechReport 02) HMM emits pairs: (c,d) in state M, pairs (c,-) in state D, and pairs (-,d) in state I. For each state there is a multinomial distribution on pairs. The HMM can trained with EM from a sample of pairs of matched strings (s,t) E-step is forward-backward; M-step uses some ad hoc smoothing

Generative version of affine gap automata (Bilenko&Mooney, TechReport 02) Using the automata: Automata is converted to operation costs for an affine-gap model as shown before. Why? Model assigns probability less than 1 for exact matches. Probability decreases monotonically with size => longer exact matches are scored lower This is approximately opposite of frequency-based heuristic

Affine gap edit-distance learning: experiments results (Bilenko & Mooney) Experimental method: parse records into fields; append a few key fields together; sort by similarity; pick a threshold T and call all pairs with distance(s,t) < T “ duplicates ” ; picking T to maximize F-measure.

Affine gap edit-distance learning: experiments results (Bilenko & Mooney)

Precision/recall for MAILING dataset duplicate detection

String distance metrics: outline Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t. Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. Pair HMM based metrics –Probabilistic extension of edit distance Other metrics

Jaro metric Jaro metric is (apparently) tuned for personal names: –Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. –Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. –Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common –Variant: weight errors early in string more heavily Easy to compute – note edit distance is O(|s||t|) NB. This is my interpretation of Winkler ’ s description

Soundex metric Soundex is a coarse phonetic indexing scheme, widely used in genealogy. Every Soundex code consists of a letter and three numbers between 0 and 6, e.g. B-536 for “ Bender ”. The letter is always the first letter of the surname. The numbers hash together the rest of the name. Vowels are generally ignored: e.g. Lee, Lu => L Later later consonants in a name are ignored. Similar-sounding letters (e.g. B, P, F, V ) are not differentiated, nor are doubled letters. There are lots of Soundex variants….

N-gram metric –Idea: split every string s into a set of all character n-grams that appear in s, for n<=k. Then, use term-based approaches. –e.g. “ COHEN ” => {C,O,H,E,N,CO,OH,HE,EN,COH,OHE,HEN} –For n=4 or 5, this is competitive on retrieval tasks. It doesn ’ t seem to be competitive with small values of n on matching tasks (but it ’ s useful as a fast approximate matching scheme)

String distance metrics: overview Term-based (e.g. as in WHIRL) –Very little comparative study for linkage/matching – Vast literature for clustering, retrieval, etc Edit-distance metrics –Consider this a quick tour cottage industry in bioinformatics Pair HMM based metrics –Trainable from matched pairs (new work!) –Odd behavior when you consider frequency issues Other metrics –Mostly somewhat task-specific