Edit Distances William W. Cohen.

Slides:



Advertisements
Similar presentations
Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8.
Advertisements

Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Clustering Beyond K-means
Learning HMM parameters
Expectation Maximization
Supervised Learning Recap
Record Linkage Tutorial: Distance Metrics for Text William W. Cohen CALD.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Segmentation and Fitting Using Probabilistic Methods
Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
Hidden Markov Models in NLP
Hidden Markov Models (HMMs) Steven Salzberg CMSC 828H, Univ. of Maryland Fall 2010.
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Processing Strings with HMMs: Structuring text and computing distances William W. Cohen CALD.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Lecture 5: Learning models using EM
Phylogenetic Trees Presenter: Michael Tung
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Hidden Markov Models Usman Roshan BNFO 601. Hidden Markov Models Alphabet of symbols: Set of states that emit symbols from the alphabet: Set of probabilities.
Genome evolution: a sequence-centric approach Lecture 3: From Trees to HMMs.
Visual Recognition Tutorial
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
In the once upon a time days of the First Age of Magic, the prudent sorcerer regarded his own true name as his most valued possession but also the greatest.
Distance functions and IE -2 William W. Cohen CALD.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Distance functions and IE William W. Cohen CALD. Announcements March 25 Thus – talk from Carlos Guestrin (Assistant Prof in Cald as of fall 2004) on max-margin.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
Minimum Edit Distance Definition of Minimum Edit Distance.
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Blocking. Basic idea: – heuristically find candidate pairs that are likely to be similar – only compare candidates, not all pairs Variant 1: – pick some.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
IE with Dictionaries Cohen & Sarawagi. Announcements Current statistics: –days with unscheduled student talks: 2 –students with unscheduled student talks:
Pair HMMs and edit distance Ristad & Yianilos. Special meeting Wed 4/14 What: Evolving and Self-Managing Data Integration Systems Who: AnHai Doan, Univ.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
John Lafferty Andrew McCallum Fernando Pereira
Relation Extraction William Cohen Kernels vs Structured Output Spaces Two kinds of structured learning: –HMMs, CRFs, VP-trained HMM, structured.
Conditional Markov Models: MaxEnt Tagging and MEMMs
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Analysis of Social Media MLD , LTI William Cohen
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Learning Analogies and Semantic Relations Nov William Cohen.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
More announcements Unofficial auditors: send to Sharon Woodside to make sure you get any late-breaking announcements. Project: –Already.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Distance functions and IE - 3 William W. Cohen CALD.
Hidden Markov Models BMI/CS 576
Classification of unlabeled data:
Edit Distances William W. Cohen.
Latent Variables, Mixture Models and EM
Kernels for Relation Extraction
Presentation transcript:

Edit Distances William W. Cohen

Midterm progress reports Talk for 5min per team You probably want to have one person speak Talk about The problem & dataset The baseline results What you plan to do next Send Brendan 3-4 slides in PDF by Mon night

Plan for this week Why EM works Edit distances Learning edit distances Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Why EM works Discriminative learning for pair HMMs

Motivation Common problem: classify a pair of strings (s,t) as “these denote the same entity [or similar entities]” Examples: (“Carnegie-Mellon University”, “Carnegie Mellon Univ.”) (“Noah Smith, CMU”, “Noah A. Smith, Carnegie Mellon”) Applications: Co-reference in NLP Linking entities in two databases Removing duplicates in a database Finding related genes “Distant learning”: training NER from dictionaries

Levenshtein distance - example distance(“William Cohen”, “Willliam Cohon”) s W I L A M _ C O H E N S 1 2 gap alignment t op cost

Computing Levenshtein distance - 2 D(i,j) = score of best alignment from s1..si to t1..tj D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete = min (simplify by letting d(c,d)=0 if c=d, 1 else) also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenshtein distance – 4 D(i-1,j-1) + d(si,tj) //subst/copy D(i-1,j)+1 //insert D(i,j-1)+1 //delete D(i,j) = min C O H E N M 1 2 3 4 5 A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

Extensions Add parameters for differential costs for delete, substitute, … operations Eg “gap cost” G, substitution costs dxy(x,y) Allow s to match a substring of t (Smith-Waterman) Model cost of length-n insertion as A + Bn instead of Gn “Affine distance” Need to remember if a gap is open in s, t, or neither

Forward-backward for HMMs All paths to st=i and all emissions up to and including t All paths after st=i and all emissions after t

pass thru states i,j at t,t+1 EM for HMMs pass thru state i at t and emit a at t pass thru states i,j at t,t+1 …and con’t to end

Pair HMM Example e Pr(e) <a,a> 0.10 <e,e> <h,h> <e,-> 0.05 <h,t> <-,h> 0.01 ... .. 1 Sample run: zT = <h,t>,<e,e><e,e><h,h>,<e,->,<e,e> Strings x,y produced by zT: x=heehee, y=teehe Notice that x,y is also produced by z4 + <e,e>,<e,-> and many other edit strings

Distances based on pair HMMs

Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K h e α(3,2) h a h

Pair HMM Inference t=1 t=2 ... t=T v=1 v=2 v=K h e α(3,2) h a h

Pair HMM Inference: Forward-Backward t=1 t=2 ... t=T v=1 v=2 v=K

EM to learn edit distances Is this really like edit distances? Not really: Sim(x,x) ≠1 Generally sim(x,x) gets smaller with longer x Edit distance is based on single best sequence; Pr(x,y) is based on weighted cost of all successful edit sequences Will learning work? Unlike linear models no guarantee of global convergence: you might not find a good model even if it exists

Back to R&Y paper... They consider “coarse” and “detailed” models, as well as mixtures of both. Coarse model is like a back-off model – merge edit operations into equivalence classes (e.g. based on equivalence classes for chars). Test by learning distance for K-NN with an additional latent variable

K-NN with latent prototypes test example y (a string of phonemes) learned phonetic distance possible prototypes x (known word pronounciation ) x1 x2 x3 xm words from dictionary w1 w2 wK

K-NN with latent prototypes Method needs (x,y) pairs to train a distance – to handle this, an additional level of E/M is used to pick the “latent prototype” to pair with each y y learned phonetic distance x1 x2 x3 xm w1 w2 wK

Plan for this week Edit distances Distance(s,t) = cost of best edit sequence that transforms st Found via…. Learning edit distances: Ristad and Yianolis Probabilistic generative model: pair HMMs Learning now requires EM Detour: EM for plain ‘ol HMMS EM for pair HMMs Why EM works Discriminative learning for pair HMMs

EM: X = data θ = model z = something you can’t observe Problem: “complete data likelihood” Algorithm: Iteratively improve θ1 θ2  … Θn= Mixturess: z is hidden mixture component … HMMs: z is hidden state sequence string Pair HMMs: z is hidden sequence of pairs (x1,y1),… given (x,y) Latent-variable topic models (e.g., LDA): z is assignment of words to topics ….

Jensen’s inequality…

Jensen’s inequality and f convex  x3

Jensen’s inequality ln x3

X = data θ = model z = something you can’t observe Let’s think about moving from θn (our current parameter vector) to some new θ (the next one, hopefully better) We want to optimize L(θ)- L(θn ) …. using something like…

Comments Nice because we often know how to Do learning in the model (if hidden variables are known) Do inference in the model (to get hidden variables) And that’s all we need to do…. Convergence: local, not global Generalized EM: E but don’t M, just improve

Key ideas Pair of strings (x,y) associated with a label: {match,nonmatch} Classification done by a pair HMM with two non-initial states: {match, non-match} w/o transitions between them Model scores alignments – emissions sequences – as match/nonmatch.

Key ideas Score the alignment sequence: Edit sequence is featurized: Marginalize over all alignments to score match v nonmatch:

Key ideas To learn, combine EM and CRF learning: compute expectations over (hidden) alignments use LBFGS to maximize (or at least improve )the parameters, λ repeat…… Initialize the model with a “reasonable” set of parameters: hand-tuned parameters for matching strings copy match parameters to non-match state and shrink them to zero.

Results We will come back to this family of methods in a couple of weeks (discriminatively trained latent-variable models).