Sequence Models With slides by me, Joshua Goodman, Fei Xia.

Slides:



Advertisements
Similar presentations
Learning HMM parameters
Advertisements

Hidden Markov Model Jianfeng Tang Old Dominion University 03/03/2004.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Hidden Markov Models in NLP
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
… Hidden Markov Models Markov assumption: Transition model:
FSA and HMM LING 572 Fei Xia 1/5/06.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Basic Probability and Statistics CIS 8590 – Fall 2008 NLP 1.
1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models David Meir Blei November 1, 1999.
Hidden Markov Models 戴玉書
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Text Models. Why? To “understand” text To assist in text search & ranking For autocompletion Part of Speech Tagging.
Conditional Random Fields
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
Hidden Markov Models in Keystroke Dynamics Md Liakat Ali, John V. Monaco, and Charles C. Tappert Seidenberg School of CSIS, Pace University, White Plains,
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
UIUC CS 498: Section EA Lecture #21 Reasoning in Artificial Intelligence Professor: Eyal Amir Fall Semester 2011 (Some slides from Kevin Murphy (UBC))
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
CPS 170: Artificial Intelligence Markov processes and Hidden Markov Models (HMMs) Instructor: Vincent Conitzer.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
1 Hidden Markov Model Xiaole Shirley Liu STAT115, STAT215.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Hidden Markov Models BMI/CS 576
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Statistical Models for Automatic Speech Recognition
Presentation transcript:

Sequence Models With slides by me, Joshua Goodman, Fei Xia

Outline Language Modeling Ngram Models Hidden Markov Models – Supervised Parameter Estimation – Probability of a sequence – Viterbi (or decoding) – Baum-Welch

3 A bad language model

4

5

6

What is a language model? Language Model: A distribution that assigns a probability to language utterances. e.g., P LM (“zxcv./,mwea afsido”) is zero; P LM (“mat cat on the sat”) is tiny; P LM (“Colorless green ideas sleeps furiously”) is bigger; P LM (“A cat sat on the mat.”) is bigger still.

What’s a language model for? Information Retrieval Handwriting recognition Speech Recognition Spelling correction Optical character recognition Machine translation …

Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Straightforward model: But this can be hard to train effectively (although see CRFs later).

Example Language Model Application Speech Recognition: convert an acoustic signal (sound wave recorded by a microphone) to a sequence of words (text file). Traditional solution: Bayes’ Rule Ignore: doesn’t matter for picking a good text Acoustic Model (easier to train) Language Model

Importance of Sequence So far, we’ve been making the exchangeability, or bag-of-words, assumption: The order of words is not important. It turns out, that’s actually not true (duh!). “cat mat on the sat” ≠ “the cat sat on the mat” “Mary loves John” ≠ “John loves Mary”

Language Models with Sequence Information Problem: How can we define a model that assigns probability to sequences of words (a language model) the probability depends on the order of the words the model can be trained and computed tractably?

Outline Language Modeling Ngram Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence (decoding) – Viterbi (Best hidden layer sequence) – Baum-Welch Conditional Random Fields

14 Smoothing: Kneser-Ney P(Francisco | eggplant) vs P(stew | eggplant) “Francisco” is common, so backoff, interpolated methods say it is likely But it only occurs in context of “San” “Stew” is common, and in many contexts Weight backoff by number of contexts word occurs in

15 Kneser-Ney smoothing (cont) Interpolation: Backoff:

Outline Language Modeling Ngram Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence (decoding) – Viterbi (Best hidden layer sequence) – Baum-Welch Conditional Random Fields

The Hidden Markov Model A dynamic Bayes Net (dynamic because the size can change). The O i nodes are called observed nodes. The S i nodes are called hidden nodes. NLP 17 S1S1 O1O1 S2S2 O2O2 SnSn OnOn … …

HMMs and Language Processing HMMs have been used in a variety of applications, but especially: – Speech recognition (hidden nodes are text words, observations are spoken words) – Part of Speech Tagging (hidden nodes are parts of speech, observations are words) NLP 18 S1S1 O1O1 S2S2 O2O2 SnSn OnOn … …

HMM Independence Assumptions HMMs assume that: S i is independent of S 1 through S i-2, given S i-1 (Markov assump.) O i is independent of all other nodes, given S i P(S i | S i-1 ) and P(O i | S i ) do not depend on i Not very realistic assumptions about language – but HMMs are often good enough, and very convenient. NLP 19 S1S1 O1O1 S2S2 O2O2 SnSn OnOn … …

HMM Formula An HMM predicts that the probability of observing a sequence o = with a particular set of hidden states s = is: To calculate, we need: - Prior: P(s 1 ) for all values of s 1 - Observation: P(o i |s i ) for all values of o i and s i - Transition: P(s i |s i-1 ) for all values of s i and s i-1

HMM: Pieces 1)A set of hidden states H = {h 1, …, h N } that are the values which hidden nodes may take. 2)A vocabulary, or set of states V = {v 1, …, v M } that are the values which an observed node may take. 3)Initial probabilities P(s 1 =h i ) for all i -Written as a vector of N initial probabilities, called π 4)Transition probabilities P(s t =h i | s t-1 =h j ) for all i, j -Written as an NxN ‘transition matrix’ A 5)Observation probabilities P(o t =v j |s t =h i ) for all j, i - written as an MxN ‘observation matrix’ B

HMM for POS Tagging 1)S = {DT, NN, VB, IN, …}, the set of all POS tags. 2)V = the set of all words in English. 3)Initial probabilities π i are the probability that POS tag can start a sentence. 4)Transition probabilities A ij represent the probability that one tag can follow another 5)Observation probabilities B ij represent the probability that a tag will generate a particular.

Outline Graphical Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence – Viterbi: what’s the best hidden state sequence? – Baum-Welch: unsupervised parameter estimation Conditional Random Fields

Supervised Parameter Estimation Given an observation sequence and states, find the HMM model ( π, A, and B) that is most likely to produce the sequence. For example, POS-tagged data from the Penn Treebank A B AAA BBBB oToT o1o1 otot o t-1 o t+1 x1x1 x t-1 xtxt x t+1 xTxT

Bayesian Parameter Estimation A B AAA BBBB oToT o1o1 otot o t-1 o t+1 x1x1 x t-1 xtxt x t+1 xTxT

Outline Graphical Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence – Viterbi – Baum-Welch Conditional Random Fields

What’s the probability of a sentence? Suppose I asked you, ‘What’s the probability of seeing a sentence w1, …, wT on the web?’ If we have an HMM model of English, we can use it to estimate the probability. (In other words, HMMs can be used as language models.)

Conditional Probability of a Sentence If we knew the hidden states that generated each word in the sentence, it would be easy:

Probability of a Sentence Via marginalization, we have: Unfortunately, if there are N values for each a i (s 1 through s N ), Then there are N T values for a 1,…,a T. Brute-force computation of this sum is intractable.

Forward Procedure oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Special structure gives us an efficient solution using dynamic programming. Intuition: Probability of the first t observations is the same for all possible t+1 length state sequences. Define:

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Forward Procedure

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Backward Procedure Probability of the rest of the states given the first state

oToT o1o1 otot o t-1 o t+1 x1x1 x t+1 xTxT xtxt x t-1 Decoding Solution Forward Procedure Backward Procedure Combination

Outline Graphical Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence – Viterbi: what’s the best hidden state sequence? – Baum-Welch Conditional Random Fields

oToT o1o1 otot o t-1 o t+1 Best State Sequence Find the hidden state sequence that best explains the observations Viterbi algorithm

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm The state sequence which maximizes the probability of seeing the observations to time t-1, landing in state j, and seeing the observation at time t x1x1 x t-1 j

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Recursive Computation x1x1 x t-1 xtxt x t+1

oToT o1o1 otot o t-1 o t+1 Viterbi Algorithm Compute the most likely state sequence by working backwards x1x1 x t-1 xtxt x t+1 xTxT

Outline Graphical Models Hidden Markov Models – Supervised parameter estimation – Probability of a sequence – Viterbi – Baum-Welch: Unsupervised parameter estimation Conditional Random Fields

oToT o1o1 otot o t-1 o t+1 Unsupervised Parameter Estimation Given an observation sequence, find the model that is most likely to produce that sequence. No analytic method Given a model and observation sequence, update the model parameters to better fit the observations. A B AAA BBBB

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Probability of traversing an arc Probability of being in state i

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Now we can compute the new estimates of the model parameters.

oToT o1o1 otot o t-1 o t+1 Parameter Estimation A B AAA BBBB Guarantee: P(o 1:T |A,B, π ) <= P(o 1:T | A ̂, B ̂, π̂ ) In other words, by repeating this procedure, we can gradually improve how well the HMM fits the unlabeled data. There is no guarantee that this will converge to the best possible HMM, however (only guaranteed to find a local maximum).

oToT o1o1 otot o t-1 o t+1 The Most Important Thing A B AAA BBBB We can use the special structure of this model to do a lot of neat math and solve problems that are otherwise not tractable.