Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles.

Similar presentations


Presentation on theme: "A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles."— Presentation transcript:

1 A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

2 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y.

3 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication Apex International HotelGrassmarket Street Apex Internat’l Grasmarket Street Records are duplicates of the same hotel?

4 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA

5 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation Il a achete une pomme He bought an apple

6 String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation –Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening

7 Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations Lowest cost alignment W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy operation cost Total cost = 6 = Levenshtein Distance delete 0 0 0 0 1 1 0 0 1 1 1 0 0 0 0 1 0 Align two strings William W. Cohon Willleam Cohen x 1 = x 2 = [1966]

8 Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2 insert subst D(i,j) = score of best alignment from x 1... x i to y 1... y j. D(i-1,j-1) +  (x i ≠y j ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 Dynamic program total cost = distance

9 Levenshtein Distance with Markov Dependencies Cost after ac i d s copyCopy a character from x to y0 0 0 0 insertInsert a character into y1 1 1 deleteDelete a character from y1 1 1 subst Substitute one character for another1 1 1 1 Edit operations W i l l l e a m 0 1 2 3 4 5 6 7 8 W 1 0 1 2 3 4 5 6 7 i 2 1 0 1 2 3 4 5 6 l 3 2 1 0 1 2 3 4 5 l 4 3 2 1 0 1 2 3 4 i 5 4 3 2 1 1 2 3 4 a 6 5 4 3 2 2 2 2 4 m 7 6 5 4 3 3 3 3 2 Learn these costs from training data subst insertdelete 3D DP table repeated delete is cheaper copy 1 2 1 2

10 Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings complete data likelihood Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy delete 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 incomplete data likelihood (sum over all alignments consistent with x 1 and x 2 ) Match score = Given training set of matching string pairs, objective fn is

11 Ristad & Yianilos Regrets Limited features of input strings –Examine only single character pair at a time –Difficult to use upcoming string context, lexicons,... –Example: “Senator John Green” “John Green” Limited edit operations –Difficult to generate arbitrary jumps in both strings –Example: “UMass” “University of Massachusetts”. Trained only on positive match data –Doesn’t include information-rich “near misses” –Example: “ACM SIGIR” ≠ “ACM SIGCHI” So, consider model trained by conditional probability

12 Conditional Probability (Sequence) Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies.

13 Joint y t-1 ytyt xtxt y t+1 x t+1 x t-1... [Lafferty, McCallum, Pereira 2001] From HMMs to Conditional Random Fields Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Conditional y t-1 ytyt y t+1 xtxt x t+1 x t-1... (A super-special case of Conditional Random Fields.) where Set parameters by maximum likelihood, using optimization method on  L. Linear-chain ^

14 (Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … where OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001]

15 CRF String Edit Distance W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopydelete substcopy delete joint complete data likelihood 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 15 16 1 2 3 4 5 6 7 8 8 8 8 9 10 11 12 13 14 x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 conditional complete data likelihood Want to train from set of string pairs, each labeled one of {match, non-match} match“William W. Cohon”“Willlleam Cohen” non-match“Bruce D’Ambrosio”“Bruce Croft” match“Tommi Jaakkola”“Tommi Jakola” match“Stuart Russell”“Stuart Russel” non-match“Tom Deitterich”“Tom Dean”

16 CRF String Edit Distance FSM subst insertdelete copy

17 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 conditional incomplete data likelihood

18 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.8 Probability summed over all alignments in non-match states 0.2 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola”

19 CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.1 Probability summed over all alignments in non-match states 0.9 x 1 = “Tom Dietterich” x 2 = “Tom Dean”

20 Parameter Estimation Expectation Maximization E-step: Estimate distribution over alignments,, using current parameters M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

21 Efficient Training Dynamic programming table is 3D; |x 1 | = |x 2 | = 100, |S| = 12,.... 120,000 entries Use beam search during E-step [Pal, Sutton, McCallum 2005] Unlike completely observed CRFs, objective function is not convex. Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

22 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola” T o m m i J a a k k o l a T o m i J a k o l a

23 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Bruce Croft” x 2 = “Tom Dean” B r u c e C r o f t T o m D e a n

24 What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Jaime Carbonell” x 2 = “Jamie Callan” J a i m e C a r b o n e l l J a m i e C a l a n

25 Example Learned Alignment

26 Summary of Advantages Arbitrary features of the input strings –Examine past, future context –Use lexicons, WordNet Extremely flexible edit operations –Single operation may make arbitrary jumps in both strings, of size determined by input features Discriminative Training –Maximize ability to predict match vs non-match

27 Experimental Results: Data Sets Restaurant name, Restaurant address –864 records, 112 matches –E.g. “Abe’s Bar & Grill, E. Main St” “Abe’s Grill, East Main Street” People names, UIS DB generator –synthetic noise –E.g. “John Smith” vs “Snith, John” CiteSeer Citations –In four sections: Reason, Face, Reinforce, Constraint –E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...” “Russell & Norvig, “Artificial Intelligence: An Intro...”

28 Experimental Results: Features same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

29 Experimental Results: Edit Operations insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string

30 Experimental Results CiteSeer ReasonFaceReinfConstraint 0.9270.9520.8930.924 0.9380.9660.9070.941 0.8970.9220.9030.923 0.9240.875 0.8080.913 Restaurant name 0.290 0.354 0.365 0.433 Restaurant address 0.686 0.712 0.380 0.532 Distance metric Levenshtein Learned Leven. Vector Learned Vector [Bilenko & Mooney 2003] F1 (average of precision and recall)

31 Experimental Results CiteSeer ReasonFaceReinfConstraint 0.9270.9520.8930.924 0.9380.9660.9070.941 0.8970.9220.9030.923 0.9240.8750.8080.913 0.9640.9180.9170.976 Restaurant name 0.290 0.354 0.365 0.433 0.448 Restaurant address 0.686 0.712 0.380 0.532 0.783 Distance metric Levenshtein Learned Leven. Vector Learned Vector CRF Edit Distance [Bilenko & Mooney 2003] F1 (average of precision and recall)

32 Experimental Results F1 0.856 0.981 Without skip-if-present-in-other-string With skip-if-present-in-other-string Data set: person names, with word-order noise added

33 Related Work Learned Edit Distance –[Bilenko & Mooney 2003], [Cohen et al 2003],... –[Joachims 2003]: Max-margin, trained on alignments Conditionally-trained models with latent variables –[Jebara 1999]: “Conditional Expectation Maximization” –[Quattoni, Collins, Darrell 2005]: CRF for visual object recognition, with latent classes for object sub-patches –[Zettlemoyer & Collins 2005]: CRF for mapping sentences to logical form, with latent parses.

34 “Predictive Random Fields” Latent Variable Models fit by Multi-way Conditional Probability For clustering structured data, ala Latent Dirichlet Allocation & its successors But an undirected model, like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005] But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B) e.g. A,B,C are different modalities (c.f. “Predictive Likelihood”) [McCallum, Wang, Pal, 2005]

35 Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by colorGeneratively trained Conditionally-trained [Jebara 1998] Predictive Random Field [McCallum, Wang, Pal, 2005]

36 Predictive Random Fields vs. Harmoniun on document retrieval task Harmonium, joint with words Harmonium, joint, with class labels and words Conditionally-trained, to predict class labels Predictive Random Field, multi-way conditionally trained [McCallum, Wang, Pal, 2005]

37 Summary String edit distance –Widely used in many fields As in CRF sequence labeling, benefit by –conditional-probability training, and –ability to use arbitrary, non-independent input features Example of conditionally-trained model with latent variables. –“Find the alignments that most help distinguish match from non-match.” –May ultimately want the alignments, but only have relatively-easier-to- label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning” Future work: Edit distance on trees. See also “Predictive Random Fields” http://www.cs.umass.edu/~pal/PRFTR.pdf

38 End of talk


Download ppt "A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles."

Similar presentations


Ads by Google