CSC 594 Topics in AI – Natural Language Processing

Slides:



Advertisements
Similar presentations
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Advertisements

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Hidden Markov Models IP notice: slides from Dan Jurafsky.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Natural Language Processing Lecture 8—2/5/2015 Susan W. Brown.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
CSA3202 Human Language Technology HMMs for POS Tagging.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models BMI/CS 576
CS 224S / LINGUIST 285 Spoken Language Processing
Learning, Uncertainty, and Information: Learning Parameters
CSCE 771 Natural Language Processing
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models - Training
CS 2750: Machine Learning Hidden Markov Models
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
CSCI 5832 Natural Language Processing
Three classic HMM problems
Hidden Markov Model LR Rabiner
N-Gram Model Formulas Word sequences Chain rule of probability
Hidden Markov Models (HMMs)
CONTEXT DEPENDENT CLASSIFICATION
Hassanin M. Al-Barhamtoshy
LECTURE 15: REESTIMATION, EM AND MIXTURES
Hidden Markov Models Teaching Demo The University of Arizona
Hidden Markov Models By Manish Shrivastava.
CSCI 5582 Artificial Intelligence
Presentation transcript:

CSC 594 Topics in AI – Natural Language Processing Spring 2018 11. Hidden Markov Model (Some slides adapted from Jurafsky & Martin, and Raymond Mooney at UT Austin)

Speech and Language Processing - Jurafsky and Martin Hidden Markov Models Hidden Markov Model (HMM) is a sequence model. A sequence model or sequence classifier is a model which assigns a label or class to each unit in a sequence, thus mapping a sequence of observations to a sequence of labels. An HMM is a probabilistic sequence model: given a sequence of units (e.g. words, letters, morphemes, sentences), they compute a probability distribution over possible sequences of labels and choose the best label sequence. This is a kind of generative model. Speech and Language Processing - Jurafsky and Martin

Sequence Labeling in NLP There are many NLP tasks which use sequence labeling: Speech recognition Part-of-speech tagging Named Entity Recognition Information Extraction But not all NLP tasks are sequence labeling – only the ones where information of the previous things in the sequence affect/help the next one. Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Definitions A weighted finite-state automaton adds probabilities to the arcs The sum of the probabilities leaving any arc must sum to one A Markov chain is a special case of a WFST in which the input sequence uniquely determines which states the automaton will go through Markov chains can’t represent inherently ambiguous problems Useful for assigning probabilities to unambiguous sequences Speech and Language Processing - Jurafsky and Martin

Markov Chain for Weather Speech and Language Processing - Jurafsky and Martin

Markov Chain: “First-order observable Markov Model” A set of states Q = q1, q2…qN; the state at time t is qt Transition probabilities: a set of probabilities A = a01a02…an1…ann. Each aij represents the probability of transitioning from state i to state j The set of these is the transition probability matrix A Special initial probability vector π Current state only depends on previous state Speech and Language Processing - Jurafsky and Martin

Markov Chain for Weather What is the probability of 4 consecutive warm days? Sequence is warm-warm-warm-warm And state sequence is 3-3-3-3 P(3,3,3,3) = 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432 Speech and Language Processing - Jurafsky and Martin

Hidden Markov Model (HMM) However, oftentimes we want to know what produced the sequence – the hidden sequence for the observed sequence. For example, Inferring the words (hidden) from acoustic signal (observed) in speech recognition Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS tagging. Assigning named entity categories (hidden) to a sentence (sequence of words) – Named Entity Recognition. Speech and Language Processing - Jurafsky and Martin

Two Kinds of Probabilities State transition probabilities -- p(ti|ti-1) State-to-state transition probabilities Observation/Emission likelihood -- p(wi|ti) Probabilities of observing various values at a given state Speech and Language Processing - Jurafsky and Martin

What’s the state sequence for the observed sequence “1 2 3”? Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Three Problems Given this framework there are 3 problems that we can pose to an HMM Given an observation sequence, what is the probability of that sequence given a model? Given an observation sequence and a model, what is the most likely state sequence? Given an observation sequence, find the best model parameters for a partially specified model Speech and Language Processing - Jurafsky and Martin

Problem 1: Obserbation Likelihood The probability of a sequence given a model... Used in model development... How do I know if some change I made to the model is making things better? And in classification tasks Word spotting in ASR, language identification, speaker identification, author identification, etc. Train one HMM model per class Given an observation, pass it to each model and compute P(seq|model). Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Problem 2: Decoding Most probable state sequence given a model and an observation sequence Typically used in tagging problems, where the tags correspond to hidden states As we’ll see almost any problem can be cast as a sequence labeling problem Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Problem 3: Learning Infer the best model parameters, given a partial model and an observation sequence... That is, fill in the A and B tables with the right numbers -- the numbers that make the observation sequence most likely This is to learn the probabilities! Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Solutions Problem 2: Viterbi Problem 1: Forward Problem 3: Forward-Backward An instance of EM (Expectation Maximization) Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Problem 2: Decoding We want, out of all sequences of n tags t1…tn the single tag sequence such that P(t1…tn|w1…wn) is highest. Hat ^ means “our estimate of the best one” Argmaxx f(x) means “the x such that f(x) is maximized” Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Getting to HMMs This equation should give us the best tag sequence But how to make it operational? How to compute this value? Intuition of Bayesian inference: Use Bayes rule to transform this equation into a set of probabilities that are easier to compute (and give the right answer) Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Using Bayes Rule Know this. Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Likelihood and Prior Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Definition of HMM States Q = q1, q2…qN; Observations O= o1, o2…oN; Each observation is a symbol from a vocabulary V = {v1,v2,…vV} Transition probabilities Transition probability matrix A = {aij} Observation likelihoods Output probability matrix B={bi(k)} Special initial probability vector  Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin HMMs for Ice Cream You are a climatologist in the year 2799 studying global warming You can’t find any records of the weather in Baltimore for summer of 2007 But you find Jason Eisner’s diary which lists how many ice-creams Jason ate every day that summer Your job: figure out how hot it was each day Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Eisner Task Given Ice Cream Observation Sequence: 1,2,3,2,2,2,3… Produce: Hidden Weather Sequence: H,C,H,H,H,C, C… Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin HMM for Ice Cream Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Ice Cream HMM Let’s just do 131 as the sequence How many underlying state (hot/cold) sequences are there? How do you pick the right one? HHH HHC HCH HCC CCC CCH CHC CHH Argmax P(sequence | 1 3 1) Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Ice Cream HMM Let’s just do 1 sequence: CHC Cold as the initial state P(Cold|Start) .2 .5 .4 .3 Observing a 1 on a cold day P(1 | Cold) Hot as the next state P(Hot | Cold) Observing a 3 on a hot day P(3 | Hot) Cold as the next state P(Cold|Hot) .0024 Observing a 1 on a cold day P(1 | Cold) Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin But there are other sequences which generate 313. For example, HHC Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Viterbi Algorithm Create an array Columns corresponding to observations Rows corresponding to possible hidden states Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. Also record “backpointers” that subsequently allow backtracing the most probable state sequence. Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin The Viterbi Algorithm Speech and Language Processing - Jurafsky and Martin

Viterbi Example: Ice Cream Speech and Language Processing - Jurafsky and Martin

Viterbi Example

Raymond Mooney at UT Austin Viterbi Backpointers s1      s2       s0    sF    sN t1 t2 t3 tT-1 tT Raymond Mooney at UT Austin 31

Raymond Mooney at UT Austin Viterbi Backtrace s1      s2       s0    sF    sN t1 t2 t3 tT-1 tT Most likely Sequence: s0 sN s1 s2 …s2 sF Raymond Mooney at UT Austin 32

Speech and Language Processing - Jurafsky and Martin Problem 1: Forward Given an observation sequence return the probability of the sequence given the model... Well in a normal Markov model, the states and the sequences are identical... So the probability of a sequence is the probability of the path sequence But not in an HMM... Remember that any number of sequences might be responsible for any given observation sequence. Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Forward Efficiently computes the probability of an observed sequence given a model P(sequence|model) Nearly identical to Viterbi; replace the MAX with a SUM Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin The Forward Algorithm Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Visualizing Forward Speech and Language Processing - Jurafsky and Martin

Speech and Language Processing - Jurafsky and Martin Problem 3 Infer the best model parameters, given a skeletal model and an observation sequence... That is, fill in the A and B tables with the right numbers... The numbers that make the observation sequence most likely Useful for getting an HMM without having to hire annotators... Speech and Language Processing - Jurafsky and Martin 37

Speech and Language Processing - Jurafsky and Martin Solutions Problem 1: Forward Problem 2: Viterbi Problem 3: EM Or Forward-backward algorithm (or Baum-Welch) Speech and Language Processing - Jurafsky and Martin 38

Speech and Language Processing - Jurafsky and Martin Forward-Backward Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation-Maximization algorithm The algorithm will let us train the transition probabilities A= {aij} and the emission probabilities B={bi(ot)} of the HMM Speech and Language Processing - Jurafsky and Martin 39 39

Intuition for re-estimation of aij We will estimate âij via this intuition: Numerator intuition: Assume we had some estimate of probability that a given transition ij was taken at time t in a observation sequence. If we knew this probability for each time t, we could sum over all t to get expected value for ij. Speech and Language Processing - Jurafsky and Martin 40 40

Intuition for re-estimation of aij Speech and Language Processing - Jurafsky and Martin 41

Speech and Language Processing - Jurafsky and Martin Re-estimating aij The expected number of transitions from state i to state j is the sum over all t of  The total expected number of transitions out of state i is the sum over all transitions out of state i Final formula for reestimated aij Speech and Language Processing - Jurafsky and Martin 42 42

The Forward-Backward Alg Speech and Language Processing - Jurafsky and Martin 43

Sketch of Baum-Welch (EM) Algorithm for Training HMMs Assume an HMM with N states. Randomly set its parameters λ=(A,B) (making sure they represent legal distributions) Until converge (i.e. λ no longer changes) do: E Step: Use the forward/backward procedure to determine the probability of various possible state sequences for generating the training data M Step: Use these probability estimates to re-estimate values for all of the parameters λ Raymond Mooney at UT Austin 44

Raymond Mooney at UT Austin EM Properties Each iteration changes the parameters in a way that is guaranteed to increase the likelihood of the data: P(O|). Anytime algorithm: Can stop at any time prior to convergence to get approximate solution. Converges to a local maximum. Raymond Mooney at UT Austin 45

Summary: Forward-Backward Algorithm Intialize =(A,B) Compute , ,  using observations Estimate new ’=(A,B) Replace  with ’ If not converged go to 2 Speech and Language Processing - Jurafsky and Martin 44 46