Processing Strings with HMMs: Structuring text and computing distances William W. Cohen CALD.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Hidden Markov Models. Room Wandering I’m going to wander around my house and tell you objects I see. Your task is to infer what room I’m in at every point.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Language Model based Information Retrieval: University of Saarland 1 A Hidden Markov Model Information Retrieval System Mahboob Alam Khalid.
Hidden Markov Models Modified from:
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Profiles for Sequences
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
… Hidden Markov Models Markov assumption: Transition model:
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Hidden Markov Models Lecture 6, Thursday April 17, 2003.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
Part 4 c Baum-Welch Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov models Sushmita Roy BMI/CS 576 Oct 16 th, 2014.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Learning Hidden Markov Model Structure for Information Extraction Kristie Seymour, Andrew McCullum, & Ronald Rosenfeld.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Tokenization & POS-Tagging
Pair HMMs and edit distance Ristad & Yianilos. Special meeting Wed 4/14 What: Evolving and Self-Managing Data Integration Systems Who: AnHai Doan, Univ.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
CS Statistical Machine learning Lecture 24
Pattern Recognition and Machine Learning-Chapter 13: Sequential Data
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Learning Analogies and Semantic Relations Nov William Cohen.
Edit Distances William W. Cohen.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Learning, Uncertainty, and Information: Learning Parameters
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
N-Gram Model Formulas Word sequences Chain rule of probability
Presentation transcript:

Processing Strings with HMMs: Structuring text and computing distances William W. Cohen CALD

Outline Motivation: adding structure to unstructured text Mathematics: –Unigram language models (& smoothing) –HMM language models –Reasoning: Viterbi, Forward-Backward –Learning: Baum-Welsh Modeling: –Normalizing addresses –Trainable string edit distance metrics

Finding structure in addresses William Cohen, 6941 Biddle St Mr. & Mrs. Steve Zubinsky, 5641 Darlington Ave Dr. Allan Hunter, Jr. 121 W. 7 th St, NW. Ava May Brown, Apt #3B, 14 S. Hunter St. George St. George Biddle Duke III, 640 Wyman Ln.

Finding structure in addresses NameNumberStreet William Cohen,6941Biddle St Mr. & Mrs. Steve Zubinsky,5641Darlington Ave Dr. Allan Hunter, Jr.121W. 7 th St, NW. Ava May Brown,Apt #3B, 14S. Hunter St. George St. George Biddle Duke III,640Wyman Ln. Knowing the structure may lead to better matching. But, how do you determine which characters go where?

Finding structure in addresses NameNumberStreet William Cohen,6941Biddle St Mr. & Mrs. Steve Zubinsky,5641Darlington Ave Dr. Allan Hunter, Jr.121W. 7 th St, NW. Ava May Brown,Apt #3B, 14S. Hunter St. George St. George Biddle Duke III,640Wyman Ln. Step 1: decide how to score an assignment of words to fields Good!

Finding structure in addresses NameNumberStreet William Cohen, 6941BiddleSt Mr. & Mrs. Steve Zubinsky,5641 DarlingtonAve Dr. Allan Hunter, Jr. 121 W. 7 th St,NW. Ava MayBrown, Apt #3B,14 S. Hunter St. George St. George BiddleDuke III, 640Wyman Ln. Not so good!

Finding structure in addresses One way to score a structure: –Use a language model to model the tokens that are likely to occur in each field –Unigram model: Tokens are drawn with replacement with probability P(token=t| field=f) = p t,f Vocabulary of N tokens has F*(N-1) parameters Can estimate p t,f from a sample. Generally need to use smoothing (e.g. Dirichlet, Good-Turing) Might use special tokens, e.g. #### vs 6941 –Bigram model, trigram model: probably not useful here

Finding structure in addresses NameNumberStreet William Cohen, 6941BiddleSt Mr. & Mrs. Steve Zubinsky,5641 DarlingtonAve Examples: P(william|Name) = pretty high P(6941|Name) = pretty low P(Zubinsky|Name) = low, but so is P(Zubinsky|Number) compared to P(6941|Number)

Finding structure in addresses Name NumberStreet WilliamCohen6941RosewoodSt Each token has a field variable - what model it was drawn from. Structure-finding is inferring the hidden field-variable value. Prob(structure) = Prob( f 1, f 2, … f K ) = ???? Prob(string|structure) =

Finding structure in addresses Name NumberStreet WilliamCohen6941RosewoodSt Each token has a field variable - what model it was drawn from. Structure-finding is inferring the hidden field-variable value. Prob(structure) = Prob( f 1, f 2, … f K ) = Prob(string|structure) = NameNumStreet Pr(f i = Num |f i-1 = Num ) Pr(f i = Street |f i-1 = Num )

Hidden Markov Models Hidden Markov model: –Set of states, each with a emission distribution P(t|f) and a next-state transition distribution P(g|f) –Designated final state, and a start distribution. NameNumStreet Pr(f i = Num |f i-1 = Num ) Kumar Dave Steve …… ###0.345 Apt0.123 …… $1.0

Hidden Markov Models Hidden Markov model: –Set of states, each with a emission distribution P(t|f) and a next-state transition distribution P(g|f) –Designated final state, and a start distribution P(f1). NameNumStreet Pr(f i = Num |f i-1 = Num ) Generate a string by 1.Pick f1 from P(f1) 2.Pick t1 by Pr(t|f1). 3.Pick f2 by Pr(f2|f1). 4.Repeat…

Hidden Markov Models NameNumStreet Generate a string by 1.Pick f1 from P(f1) 2.Pick t1 by Pr(t|f1). 3.Pick f2 by Pr(f2|f1). 4.Repeat… Name William Name Cohen Num 6941 Street Rosewood Street St

Bayes rule for HMMs Question: given t1,…,tK, what is the most likely sequence of hidden states f1,…,fK ? Name Num Str WilliamCohen6941RosewdSt

Bayes rule for HMMs Name Num Str WilliamCohen6941RosewdSt Key observation:

Bayes rule for HMMs Name Num Str WilliamCohen6941RosewdSt Look at one hidden state:

Bayes rule for HMMs Easy to calculate! Compute with dynamic programming…

Forward-Backward Forward(s,1) = Pr(f 1 =s) Forward(s,i+1) = Backward(s,K) = 1 for the final state s Backward(s,i) =

Forward-Backward Name Num Str WilliamCohen6941RosewdSt

Forward-Backward Name Num Str WilliamCohen6941RosewdSt

Viterbi The sequence of ML hidden states might not be the ML sequence of hidden states. The Viterbi algorithm finds most likely state sequence –Iterative algorithm, similar to Forward computation –Uses a max instead of a summation

Parameter learning with E/M Expectation-Maximization: for Model M for data D with hidden variables H –Initialize: pick values for M and H –E step: compute E[H=h|D,M] Here: compute Pr( f i =s) –M step: pick M to maximize Pr(D,H|M) Here: re-estimate transition probabilities and language models given estimated probabilities of hidden state variables For HMMs this is called Baum-Welsch

Finding structure in addresses Name NumberStreet WilliamCohen6941RosewoodSt Infer structure with Viterbi (or Forward-Backward) Train with Labeled data (where f1,..,fK is known) Unlabeled data (with Baum-Welsh) Partly-labeled data (e.g. lists of known names from a related source to estimate Name state emission probabilities)

Experiments: Seymour et al Adding structure to research-paper title pages. Data: 1000 labeled title pages, 2.4M words of BibTex data Estimate LM parameters with labeled data only, uniform probability of transitions: 64.5% of hidden variables are correct. Estimate transition probabilities as well: 85.9%. Estimate everything using all data: 90.5% Use mixture model to interpolate BibTex unigram model and labeled-data model: 92.4%.

Experiments: Christen & Churches Structuring problem: Australian addresses

Experiments: Christen & Churches Using same HMM technique for structuring, and using labeled data only for training.

Experiments: Christen & Churches HMM1 = 1,450 training records HMM2 = additional records from another source HMM3 = “unusual records” AutoStan = rule-based approach “developed over years”

Experiments: Christen & Churches Second (more regular) dataset: less impressive results, relative to rules. Figures are min/max average on 10-CV