Distance functions and IE - 3 William W. Cohen CALD.

Distance functions and IE - 3 William W. Cohen CALD

Announcements No meeting this Wed March 24 March 25 Thus – talk from Carlos Guestrin on max-margin Markov nets –Newell-Simon Hall 1507 at 9:30am –no wait! – make that Wean Hall 4625 Writeups: –today: “distance metrics for text” – three papers

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation –Storing results of information extraction in a database, or answering queries that involve information extracted from different places Key step: measuring similarity of two strings –TFIDF metric (WHIRL) –Edit distance (Monge-Elkan)

The data integration problem

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment gap

Computing Levenshtein distance D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)

Smith-Waterman distance c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5

Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)

Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation –Storing results of information extraction in a database, or answering queries that involve information extracted from different places Key step: measuring similarity of two strings –TFIDF metric (WHIRL) –Edit distance (Monge-Elkan)

Inference in WHIRL Explode p(X1,X2,X3): find all DB tuples for p and bind Xi to ai. Constrain X~Y: if X is bound to a and Y is unbound, –find DB column C to which Y should be bound –pick a term t in X, find proper inverted index for t in C, and bind Y to something in that index Keep track of t’s used previously, and don’t allow Y to contain one.

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

Jaro metric Jaro metric is (apparently) tuned for personal names: –Given (s,t) define c to be common in s,t if it si=c, tj=c, and |i-j|<min(|s|,|t|)/2. –Define c,d to be a transposition if c,d are common and c,d appear in different orders in s and t. –Jaro(s,t) = average of #common/|s|, #common/|t|, and 0.5#transpositions/#common –Variant: weight errors early in string more heavily Fast to compute

Jaro metric

Winkler-Jaro metric

String distance metrics so far... Term-based (e.g. TF/IDF as in WHIRL) –Distance depends on set of words contained in both s and t – so sensitive to spelling errors. –Usually weight words to account for “importance” –Fast comparison: O(n log n) for |s|+|t|=n Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –No notion of word importance –More expensive: O(n 2 ) Other metrics –Jaro metric & variants –Monge-Elkan’s recursive string matching –etc? Which metrics work best, for which problems?

So which metric should you use? Java toolkit of string-matching methods from AI, Statistics, IR and DB communities Tools for evaluating performance on test data Exploratory tool for adding, testing, combining string distances –e.g. SecondString implements a generic “Winkler rescorer” which can rescale any distance function with range of [0,1] URL – http://secondstring.sourceforge.net Distribution also includes several sample matching problems. SecondString (Cohen, Ravikumar, Fienberg):

SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler –Less ad hoc Jaro variants Term-based –TFIDF –Jaccard distance:

SecondString distance functions Edit-distance like: –Levenshtein – unit costs –untuned Smith-Waterman –Monge-Elkan (tuned Smith-Waterman) –Jaro and Jaro-Winkler

Results - Edit Distances Monge-Elkan is the best on average....

Edit distances

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance: –Language models: construct P S and P T and use

SecondString distance functions Term-based, for sets of terms S and T: –TFIDF distance –Jaccard distance –Jensen-Shannon distance smoothing toward union of S,T reduces cost of disagreeing on common terms unsmoothed P S, Dirichlet smoothing, Jelenik-Mercer – “Simplified Fellegi-Sunter”

Results – Token Distances

SecondString distance functions Hybrid term-based & edit-distance based: –Monge-Elkan’s “recursive matching scheme”, segmenting strings at token boundaries (rather than separators like commas) –SoftTFIDF Like TFIDF but consider not just tokens in both S and T, but tokens in S “close to” something in T (“close to” relative to some distance metric) Downweight close tokens slightly

Results – Hybrid distances

Results - Overall

Prospective test on two clustering tasks

An anomolous dataset

An anomalous dataset: census

Distance functions and IE - 3 William W. Cohen CALD.

Similar presentations

Presentation on theme: "Distance functions and IE - 3 William W. Cohen CALD."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distance functions and IE - 3 William W. Cohen CALD.

Similar presentations

Presentation on theme: "Distance functions and IE - 3 William W. Cohen CALD."— Presentation transcript:

Similar presentations

About project

Feedback