Download presentation
Presentation is loading. Please wait.
Published byChastity Elliott Modified over 9 years ago
1
Distance functions and IE William W. Cohen CALD
2
Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin Markov nets –time constraints? Writeups: –nothing today –“distance metrics for text” – three papers due next Monday, 3/22
3
Record linkage: definition Record linkage: determine if pairs of data records describe the same entity –I.e., find record pairs that are co-referent –Entities: usually people (or organizations or…) –Data records: names, addresses, job titles, birth dates, … Main applications: –Joining two heterogeneous relations –Removing duplicates from a single relation
4
Record linkage: terminology The term “record linkage” is possibly co- referent with: –For DB people: data matching, merge/purge, duplicate detection, data cleansing, ETL (extraction, transfer, and loading), de-duping –For AI/ML people: reference matching, database hardening, object consolidation, –In NLP: co-reference/anaphora resolution –Statistical matching, clustering, language modeling, …
5
Motivation Q: What does this have to do with IE? A: Quite a lot, actually... Webfind (Monge & Elkan, 1995)
6
Finding a technical paper c. 1995 Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp. 81-91, July 1994. Find author’s institution (w/ INSPEC) Find web host (w/ NETFIND) Find author’s home page and (hopefully) the paper by browsing
7
Automatically finding a technical paper c. 1995 with WebFind Start with citation: " Experience With a Learning Personal Assistant", T.M. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, Communications of the ACM, Vol. 37, No. 7, pp. 81-91, July 1994. Find author’s institution (w/ automated search against INSPEC) Find web host (w/ auto search on NETFIND) Find author’s home page and (hopefully) the paper by heuristic spidering
8
The data integration problem
9
Control flow (modulo details about querying –Extract (author, department) pairs from DB1 –Extract (department,www server) pairs from DB2 –Execute the two-step plan to get paper: author -> department -> wwwServer –two steps means matching (linking, integrating, deduping,....) department names in DB1/DB2 –issues are completely different if user is executing a one-step plan: one-step plan is retrieval
10
String distance metrics: Levenshtein Edit-distance metrics –Distance is shortest sequence of edit commands that transform s to t. –Simplest set of operations: Copy character from s over to t Delete a character in s (cost 1) Insert a character in t (cost 1) Substitute one character for another (cost 1) –This is “Levenshtein distance”
11
Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment
12
Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) WILLIAM_COHEN WILLLIAM_COHON CCCCICCCCCCCSC 00001111111122 s t op cost alignment gap
13
Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1), if si=tj //copy D(i-1,j-1)+1, if si!=tj //substitute D(i-1,j)+1 //insert D(i,j-1)+1 //delete
14
Computing Levenshtein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j
15
Computing Levenshtein distance - 3 D(i,j)= min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M12345 C12345 C22345 O32345 H43234 N54333 = D(s,t)
16
Computing Levenshtein distance – 4 D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete COHEN M 1 2345 C 1 2345 C 2 3345 O 3 2 345 H 43 2 3 4 N 5433 3 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)
17
Needleman-Wunch distance D(i,j) = min D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j) + G //insert D(i,j-1) + G //delete d(c,d) is an arbitrary distance function on characters (e.g. related to typo frequencies, amino acid substitutibility, etc) William Cohen Wukkuan Cigeb G = “gap cost”
18
Smith-Waterman distance - 1 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete Distance is maximum over all i,j in table of D(i,j)
19
Smith-Waterman distance - 2 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M -2-3-4-5 C+1 0-2-3 C 0 0-2-3 O +2 +1 0 H -2+1 +4+3 +2 N -3 0+3 +5
20
Smith-Waterman distance - 3 D(i,j) = max 0 //start over D(i-1,j-1) - d(si,tj) //subst/copy D(i-1,j) - G //insert D(i,j-1) - G //delete G = 1 d(c,c) = -2 d(c,d) = +1 COHEN M 00000 C+1 0000 C 0 0000 O 0 +2 +1 00 H 0 +4+3 +2 N 0 0+3 +5
21
Smith-Waterman distance - 5 c o h e n d o r f m 0 0 0 0 0 0 0 0 0 c 1 0 0 0 0 0 0 0 0 c 0 0 0 0 0 0 0 0 0 o 0 2 1 0 0 0 2 1 0 h 0 1 4 3 2 1 1 1 0 n 0 0 3 3 5 4 3 2 1 s 0 0 2 2 4 4 3 2 1 k 0 0 1 1 3 3 3 2 1 i 0 0 0 0 2 2 2 2 1 dist=5
22
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996)
23
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) Used a standard version of Smith-Waterman with hand-tuned weights for inserts and character substitutions. Split large text fields by separators like commas, etc, and explore different pairings (since S-W assigns a large cost to large transpositions) Result competitive with plausible competitors.
24
Smith-Waterman distance in Monge & Elkan’s WEBFIND (1996) String s=A 1 A 2... A K, string t=B 1 B 2... B L sim’ is editDistance scaled to [0,1] Monge-Elkan’s “recursive matching scheme” is average maximal similarity of A i to B j:
25
Smith-Waterman distance: Monge & Elkan’s WEBFIND (1996) computer science department stanford university palo alto california Dept. of Comput. Sci. Stanford Univ. CA USA 0.51 0.92 0.5 1.0
29
Results: S-W from Monge & Elkan
30
Affine gap distances Smith-Waterman fails on some pairs that seem quite similar: William W. Cohen William W. ‘Don’t call me Dubya’ Cohen Intuitively, a single long insertion is “cheaper” than a lot of short insertions Intuitively, are springlest hulongru poinstertimon extisn’t “cheaper” than a lot of short insertions
31
Affine gap distances - 2 Idea: –Current cost of a “gap” of n characters: nG –Make this cost: A + (n-1)B, where A is cost of “opening” a gap, and B is cost of “continuing” a gap.
32
Affine gap distances - 3 D(i,j) = max D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)-1 //insert D(i,j-1)-1 //delete IS(i,j) = max D(i-1,j) - A IS(i-1,j) - B IT(i,j) = max D(i,j-1) - A IT(i,j-1) - B Best score in which si is aligned with a ‘gap’ Best score in which tj is aligned with a ‘gap’ D(i-1,j-1) + d(si,tj) IS(I-1,j-1) + d(si,tj) IT(I-1,j-1) + d(si,tj)
33
Affine gap distances - 4 -B -d(si,tj) D IS IT -d(si,tj) -A
34
Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Goal is to match data like this:
35
Affine gap distances – experiments ( from McCallum,Nigam,Ungar KDD2000) Hand-tuned edit distance Lower costs for affine gaps Even lower cost for affine gaps near a “.” HMM-based normalization to group title, author, booktitle, etc into fields
36
Affine gap distances – experiments TFIDFEdit Distance Adaptive Cora 0.7510.839 0.945 0.721 0.964 OrgName10.9250.6330.923 0.3660.9500.776 Orgname20.9580.5710.958 0.7780.9120.984 Restaurant0.9810.8271.000 0.9670.8670.950 Parks0.9760.9670.984 0.967
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.