A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles.

Slides:



Advertisements
Similar presentations
What Did We See? & WikiGIS Chris Pal University of Massachusetts A Talk for Memex Day MSR Redmond, July 19, 2006.
Advertisements

Discriminative Training of Markov Logic Networks
An Introduction to Conditional Random Field Ching-Chun Hsiao 1.
Unsupervised Learning
Learning for Structured Prediction Overview of the Material TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AA A A A.
Structured SVM Chen-Tse Tsai and Siddharth Gupta.
Expectation Maximization
Supervised Learning Recap
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
John Lafferty, Andrew McCallum, Fernando Pereira
Conditional Random Fields - A probabilistic graphical model Stefan Mutter Machine Learning Group Conditional Random Fields - A probabilistic graphical.
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data John Lafferty Andrew McCallum Fernando Pereira.
Profiles for Sequences
Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.
Generative Topic Models for Community Analysis
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.
Logistics Course reviews Project report deadline: March 16 Poster session guidelines: – 2.5 minutes per poster (3 hrs / 55 minus overhead) – presentations.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Learning Seminar, 2004 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data J. Lafferty, A. McCallum, F. Pereira Presentation:
Research Introspection “ICML does ICML” Andrew McCallum Computer Science Department University of Massachusetts Amherst.
Conditional Random Fields
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining - revision Martin Russell.
Hidden Markov Models (HMMs) for Information Extraction
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
CPSC 422, Lecture 18Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Feb, 25, 2015 Slide Sources Raymond J. Mooney University of.
UMass and Learning for CALO Andrew McCallum Information Extraction & Synthesis Laboratory Department of Computer Science University of Massachusetts.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Crash Course on Machine Learning
STRUCTURED PERCEPTRON Alice Lai and Shi Zhi. Presentation Outline Introduction to Structured Perceptron ILP-CRF Model Averaged Perceptron Latent Variable.
Distance functions and IE -2 William W. Cohen CALD.
Extracting Places and Activities from GPS Traces Using Hierarchical Conditional Random Fields Yong-Joong Kim Dept. of Computer Science Yonsei.
Graphical models for part of speech tagging
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
1 Interactive Information Extraction and Social Network Analysis Andrew McCallum Information Extraction and Synthesis Laboratory UMass Amherst.
Better Punctuation Prediction with Dynamic Conditional Random Fields Wei Lu and Hwee Tou Ng National University of Singapore.
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Eric Xing © Eric CMU, Machine Learning Latent Aspect Models Eric Xing Lecture 14, August 15, 2010 Reading: see class homepage.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
An Asymptotic Analysis of Generative, Discriminative, and Pseudolikelihood Estimators by Percy Liang and Michael Jordan (ICML 2008 ) Presented by Lihan.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
First-Order Probabilistic Models for Coreference Resolution Aron Culotta Computer Science Department University of Massachusetts Amherst Joint work with.
Presented by Jian-Shiun Tzeng 5/7/2009 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania CIS Technical Report MS-CIS
Machine Learning Overview Tamara Berg CS 560 Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart.
CPSC 422, Lecture 19Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of.
Lecture 2: Statistical learning primer for biologists
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
A Dynamic Conditional Random Field Model for Object Segmentation in Image Sequences Duke University Machine Learning Group Presented by Qiuhua Liu March.
John Lafferty Andrew McCallum Fernando Pereira
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Information Extraction Entity Extraction: Statistical Methods Sunita Sarawagi.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Edit Distances William W. Cohen.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
SA-1 University of Washington Department of Computer Science & Engineering Robotics and State Estimation Lab Dieter Fox Stephen Friedman, Lin Liao, Benson.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Conditional Random Fields and Its Applications Presenter: Shih-Hsiang Lin 06/25/2007.
Distance functions and IE - 3 William W. Cohen CALD.
Brief Intro to Machine Learning CS539
Learning to Align: a Statistical Approach
Presentation transcript:

A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance Andrew McCallum Kedar Bellare Fernando Pereira Thanks to Charles Sutton, Xuerui Wang and Mikhail Bilenko for helpful discussions.

String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y.

String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication Apex International HotelGrassmarket Street Apex Internat’l Grasmarket Street Records are duplicates of the same hotel?

String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences AGCTCTTACGATAGAGGACTCCAGA AGGTCTTACCAAAGAGGACTTCAGA

String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation Il a achete une pomme He bought an apple

String Edit Distance Distance between sequences x and y: –“cost” of lowest-cost sequence of edit operations that transform string x into y. Applications –Database Record Deduplication –Biological Sequences –Machine Translation –Textual Entailment He bought a new car last night He purchased a brand new automobile yesterday evening

Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations Lowest cost alignment W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy operation cost Total cost = 6 = Levenshtein Distance delete Align two strings William W. Cohon Willleam Cohen x 1 = x 2 = [1966]

Levenshtein Distance copyCopy a character from x to y(cost 0) insertInsert a character into y(cost 1) deleteDelete a character from y(cost 1) substSubstitute one character for another(cost 1) Edit operations W i l l l e a m W i l l i a m insert subst D(i,j) = score of best alignment from x 1... x i to y 1... y j. D(i-1,j-1) +  (x i ≠y j ) D(i,j) = min D(i-1,j) + 1 D(i,j-1) + 1 Dynamic program total cost = distance

Levenshtein Distance with Markov Dependencies Cost after ac i d s copyCopy a character from x to y insertInsert a character into y1 1 1 deleteDelete a character from y1 1 1 subst Substitute one character for another Edit operations W i l l l e a m W i l l i a m Learn these costs from training data subst insertdelete 3D DP table repeated delete is cheaper copy

Ristad & Yianilos (1997) Essentially a Pair-HMM, generating a edit/state/alignment-sequence and two strings complete data likelihood Learn via EM: Expectation step: Calculate likelihood of alignment paths Maximization step: Make those paths more likely. W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopy delete substcopy delete x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 incomplete data likelihood (sum over all alignments consistent with x 1 and x 2 ) Match score = Given training set of matching string pairs, objective fn is

Ristad & Yianilos Regrets Limited features of input strings –Examine only single character pair at a time –Difficult to use upcoming string context, lexicons,... –Example: “Senator John Green” “John Green” Limited edit operations –Difficult to generate arbitrary jumps in both strings –Example: “UMass” “University of Massachusetts”. Trained only on positive match data –Doesn’t include information-rich “near misses” –Example: “ACM SIGIR” ≠ “ACM SIGCHI” So, consider model trained by conditional probability

Conditional Probability (Sequence) Models We prefer a model that is trained to maximize a conditional probability rather than joint probability: P(y|x) instead of P(y,x): –Can examine features, but not responsible for generating them. –Don’t have to explicitly model their dependencies.

Joint y t-1 ytyt xtxt y t+1 x t+1 x t-1... [Lafferty, McCallum, Pereira 2001] From HMMs to Conditional Random Fields Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… Conditional y t-1 ytyt y t+1 xtxt x t+1 x t-1... (A super-special case of Conditional Random Fields.) where Set parameters by maximum likelihood, using optimization method on  L. Linear-chain ^

(Linear Chain) Conditional Random Fields y t-1 y t x t y t+1 x t +1 x t - 1 Finite state modelGraphical model Undirected graphical model, trained to maximize conditional probability of output sequence given input sequence... FSM states observations y t+2 x t +2 y t+3 x t +3 said Jones a Microsoft VP … where OTHER PERSON OTHER ORG TITLE … output seq input seq Asian word segmentation [COLING’04], [ACL’04] IE from Research papers [HTL’04] Object classification in images [CVPR ‘04] Wide-spread interest, positive experimental results in many applications. Noun phrase, Named entity [HLT’03], [CoNLL’03] Protein structure prediction [ICML’04] IE from Bioinformatics text [Bioinformatics ‘04],… [Lafferty, McCallum, Pereira 2001]

CRF String Edit Distance W i l l i a m _ W. _ C o h o n W i l l l e a m _ C o h e n copy subst copy insertcopydelete substcopy delete joint complete data likelihood x1x1 x2x2 a.i 1 a.e a.i 2 string 1 alignment string 2 conditional complete data likelihood Want to train from set of string pairs, each labeled one of {match, non-match} match“William W. Cohon”“Willlleam Cohen” non-match“Bruce D’Ambrosio”“Bruce Croft” match“Tommi Jaakkola”“Tommi Jakola” match“Stuart Russell”“Stuart Russel” non-match“Tom Deitterich”“Tom Dean”

CRF String Edit Distance FSM subst insertdelete copy

CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 conditional incomplete data likelihood

CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.8 Probability summed over all alignments in non-match states 0.2 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola”

CRF String Edit Distance FSM subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 Probability summed over all alignments in match states 0.1 Probability summed over all alignments in non-match states 0.9 x 1 = “Tom Dietterich” x 2 = “Tom Dean”

Parameter Estimation Expectation Maximization E-step: Estimate distribution over alignments,, using current parameters M-step: Change parameters to maximize the complete (penalized) log likelihood, with an iterative quasi-Newton method (BFGS) Given training set of string pairs and match/non-match labels, objective fn is the incomplete log likelihood The complete log likelihood This is “conditional EM”, but avoid complexities of [Jebara 1998], because no need to solve M-step in closed form.

Efficient Training Dynamic programming table is 3D; |x 1 | = |x 2 | = 100, |S| = 12, ,000 entries Use beam search during E-step [Pal, Sutton, McCallum 2005] Unlike completely observed CRFs, objective function is not convex. Initialize parameters not at zero, but so as to yield a reasonable initial edit distance.

What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Tommi Jaakkola” x 2 = “Tommi Jakola” T o m m i J a a k k o l a T o m i J a k o l a

What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Bruce Croft” x 2 = “Tom Dean” B r u c e C r o f t T o m D e a n

What Alignments are Learned? subst insertdelete copy subst insertdelete copy Start match m = 1 non-match m = 0 x 1 = “Jaime Carbonell” x 2 = “Jamie Callan” J a i m e C a r b o n e l l J a m i e C a l a n

Example Learned Alignment

Summary of Advantages Arbitrary features of the input strings –Examine past, future context –Use lexicons, WordNet Extremely flexible edit operations –Single operation may make arbitrary jumps in both strings, of size determined by input features Discriminative Training –Maximize ability to predict match vs non-match

Experimental Results: Data Sets Restaurant name, Restaurant address –864 records, 112 matches –E.g. “Abe’s Bar & Grill, E. Main St” “Abe’s Grill, East Main Street” People names, UIS DB generator –synthetic noise –E.g. “John Smith” vs “Snith, John” CiteSeer Citations –In four sections: Reason, Face, Reinforce, Constraint –E.g. “Rusell & Norvig, “Artificial Intelligence: A Modern...” “Russell & Norvig, “Artificial Intelligence: An Intro...”

Experimental Results: Features same, different same-alphabetic, different alphbetic same-numeric, different-numeric punctuation1, punctuation2 alphabet-mismatch, numeric-mismatch end-of-1, end-of-2 same-next-character, different-next-character

Experimental Results: Edit Operations insert, delete, substitute/copy swap-two-characters skip-word-if-in-lexicon skip-parenthesized-words skip-any-word substitute-word-pairs-in-translation-lexicon skip-word-if-present-in-other-string

Experimental Results CiteSeer ReasonFaceReinfConstraint Restaurant name Restaurant address Distance metric Levenshtein Learned Leven. Vector Learned Vector [Bilenko & Mooney 2003] F1 (average of precision and recall)

Experimental Results CiteSeer ReasonFaceReinfConstraint Restaurant name Restaurant address Distance metric Levenshtein Learned Leven. Vector Learned Vector CRF Edit Distance [Bilenko & Mooney 2003] F1 (average of precision and recall)

Experimental Results F Without skip-if-present-in-other-string With skip-if-present-in-other-string Data set: person names, with word-order noise added

Related Work Learned Edit Distance –[Bilenko & Mooney 2003], [Cohen et al 2003],... –[Joachims 2003]: Max-margin, trained on alignments Conditionally-trained models with latent variables –[Jebara 1999]: “Conditional Expectation Maximization” –[Quattoni, Collins, Darrell 2005]: CRF for visual object recognition, with latent classes for object sub-patches –[Zettlemoyer & Collins 2005]: CRF for mapping sentences to logical form, with latent parses.

“Predictive Random Fields” Latent Variable Models fit by Multi-way Conditional Probability For clustering structured data, ala Latent Dirichlet Allocation & its successors But an undirected model, like the Harmonium [Welling, Rosen-Zvi, Hinton, 2005] But trained by a “multi-conditional” objective: O = P(A|B,C) P(B|A,C) P(C|A,B) e.g. A,B,C are different modalities (c.f. “Predictive Likelihood”) [McCallum, Wang, Pal, 2005]

Predictive Random Fields mixture of Gaussians on synthetic data Data, classify by colorGeneratively trained Conditionally-trained [Jebara 1998] Predictive Random Field [McCallum, Wang, Pal, 2005]

Predictive Random Fields vs. Harmoniun on document retrieval task Harmonium, joint with words Harmonium, joint, with class labels and words Conditionally-trained, to predict class labels Predictive Random Field, multi-way conditionally trained [McCallum, Wang, Pal, 2005]

Summary String edit distance –Widely used in many fields As in CRF sequence labeling, benefit by –conditional-probability training, and –ability to use arbitrary, non-independent input features Example of conditionally-trained model with latent variables. –“Find the alignments that most help distinguish match from non-match.” –May ultimately want the alignments, but only have relatively-easier-to- label +/- labels at training time: “Distantly-labeled data”, “semi-supervised learning” Future work: Edit distance on trees. See also “Predictive Random Fields”

End of talk