1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07. 2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM.

Slides:



Advertisements
Similar presentations
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
Advertisements

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Foundations of Statistical NLP Chapter 9. Markov Models 한 기 덕한 기 덕.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Albert Gatt Corpora and Statistical Methods Lecture 8.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Hidden Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
… Hidden Markov Models Markov assumption: Transition model:
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
FSA and HMM LING 572 Fei Xia 1/5/06.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Finite state automaton (FSA)
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
1 Finite state automaton (FSA) LING 570 Fei Xia Week 2: 10/07/09 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA.
Inside-outside algorithm LING 572 Fei Xia 02/28/06.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Doug Downey, adapted from Bryan Pardo,Northwestern University
Hidden Markov Models David Meir Blei November 1, 1999.
Sequence labeling and beam search LING 572 Fei Xia 2/15/07.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Text Models Continued HMM and PCFGs. Recap So far we have discussed 2 different models for text – Bag of Words (BOW) where we introduced TF-IDF Location.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
S. Salzberg CMSC 828N 1 Three classic HMM problems 2.Decoding: given a model and an output sequence, what is the most likely state sequence through the.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
Dongfang Xu School of Information
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS621: Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 33,34– HMM, Viterbi, 14 th Oct, 18 th Oct, 2010.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
CPSC 503 Computational Linguistics
Hidden Markov Models Teaching Demo The University of Arizona
Presentation transcript:

1 HMM (I) LING 570 Fei Xia Week 7: 11/5-11/7/07

2 HMM Definition and properties of HMM –Two types of HMM Three basic questions in HMM

3 Definition of HMM

4 Hidden Markov Models There are n states s 1, …, s n in an HMM, and the states are connected. The output symbols are produced by the states or edges in HMM. An observation O=(o 1, …, o T ) is a sequence of output symbols. Given an observation, we want to recover the hidden state sequence. An example: POS tagging –States are POS tags –Output symbols are words –Given an observation (i.e., a sentence), we want to discover the tag sequence.

5 Same observation, different state sequences V DT PN timeflieslikeanarrow NN DT VN timeflieslikeanarrow N

6 Two types of HMMs State-emission HMM (Moore machine): –The output symbol is produced by states: By the from-state By the to-state Arc-emission HMM (Mealy machine): –The output symbol is produce by the edges; i.e., by the (from-state, to-state) pairs.

7 PFA recap

8 Formal definition of PFA A PFA is Q: a finite set of N states Σ: a finite set of input symbols I: Q  R + (initial-state probabilities) F: Q  R + (final-state probabilities) : the transition relation between states. P: (transition probabilities)

9 Constraints on function: Probability of a string:

10 An example of PFA q 0 :0 q 1 :0.2 b:0.8 a:1.0 I(q 0 )=1.0 I(q 1 )=0.0 P(ab n )=I(q 0 )*P(q 0,ab n,q 1 )*F(q 1 ) =1.0 * 1.0*0.8 n *0.2 F(q 0 )=0 F(q 1 )=0.2

11 Arc-emission HMM

12 Definition of arc-emission HMM A HMM is a tuple : –A set of states S={s 1, s 2, …, s N }. –A set of output symbols Σ={w 1, …, w M }. –Initial state probabilities –Transition prob: A={a ij }. –Emission prob: B={b ijk }

13 Constraints in an arc-emission HMM For any integer n and any HMM

14 An example: HMM structure s1s1 s2s2 sNsN … w1w1 w5w5 Same kinds of parameters but the emission probabilities depend on both states: P(w k | s i, s j )  # of Parameters: O(N 2 M + N 2 ). w4w4 w3w3 w2w2 w1w1 w1w1

15 A path in an arc emission HMM o1o1 onon X1X1 X2X2 XnXn … o2o2 X n+1 State sequence: X 1,n+1 Output sequence: O 1,n

16 PFA vs. Arc-emission HMM A PFA is Q: a finite set of N states Σ: a finite set of input symbols I: Q  R + (initial-state probabilities) F: Q  R + (final-state probabilities) : the transition relation between states. P: (transition probabilities) A HMM is a tuple : –A set of states S={s 1, s 2, …, s N }. –A set of output symbols Σ={w 1, …, w M }. –Initial state probabilities –Transition prob: A={a ij }. –Emission prob: B={b ijk }

17 State-emission HMM

18 Definition of state-emission HMM A HMM is a tuple : –A set of states S={s 1, s 2, …, s N }. –A set of output symbols Σ={w 1, …, w M }. –Initial state probabilities –Transition prob: A={a ij }. –Emission prob: B={b jk } We use s i and w k to refer to what is in an HMM structure. We use X i and O i to refer to what is in a particular HMM path and its output

19 Constraints in a state-emission HMM For any integer n and any HMM

20 An example: the HMM structure Two kinds of parameters: Transition probability: P(s j | s i ) Emission probability: P(w k | s i )  # of Parameters: O(NM+N 2 ) w1w1 w2w2 w1w1 s1s1 s2s2 sNsN … w5w5 w3w3 w1w1

21 Output symbols are generated by the from-states State sequence: X 1,n Output sequence: O 1,n o1o1 onon X1X1 X2X2 XnXn … o2o2

22 Output symbols are generated by the to-states State sequence: X 1,n+1 Output sequence: O 1,n o1o1 onon X2X2 X3X3 X n+1 … o2o2 X1X1

23 A path in a state-emission HMM o1o1 onon X1X1 X2X2 XnXn … o2o2 o1o1 onon X2X2 X3X3 X n+1 … o2o2 X1X1 Output symbols are produced by the from-states: Output symbols are produced by the to-states:

24 Arc-emission vs. state-emission o1o1 onon X2X2 X3X3 X n+1 … o2o2 X1X1 o1o1 onon X1X1 X2X2 XnXn … o2o2

25 Properties of HMM Markov assumption (Limited horizon): Stationary distribution (Time invariance): the probabilities do not change over time: The states are hidden because we know the structure of the machine (i.e., S and Σ), but we don’t know which state sequences generate a particular output.

26 Are the two types of HMMs equivalent? For each state-emission HMM 1, there is an arc-emission HMM 2, such that for any sequence O, P(O|HMM 1 )=P(O|HMM 2 ). The reverse is also true. How to prove that?

27 Applications of HMM N-gram POS tagging –Bigram tagger: o i is a word, and s i is a POS tag. Other tagging problems: –Word segmentation –Chunking –NE tagging –Punctuation predication –…–… Other applications: ASR, ….

28 Three HMM questions

29 Three fundamental questions for HMMs Training an HMM: given a set of observation sequences, learn its distribution, i.e. learn the transition and emission probabilities HMM as a parser: Finding the best state sequence for a given observation HMM as an LM: compute the probability of a given observation

30 Training an HMM: estimating the probabilities Supervised learning: –The state sequences in the training data are known –ML estimation Unsupervised learning: –The state sequences in the training data are unknown –forward-backward algorithm

31 HMM as a parser

32 HMM as a parser: Finding the best state sequence Given the observation O 1,T =o 1 …o T, find the state sequence X 1,T+1 =X 1 … X T+1 that maximizes P(X 1,T+1 | O 1,T ).  Viterbi algorithm X1X1 X2X2 XTXT … o1o1 o2o2 oToT X T+1

33 “time flies like an arrow” \init BOS 1.0 \transition BOS N 0.5 BOS DT 0.4 BOS V 0.1 DT N 1.0 N N 0.2 N V 0.7 N P 0.1 V DT 0.4 V N 0.4 V P 0.1 V V 0.1 P DT 0.6 P N 0.4 \emission N time 0.1 V time 0.1 N flies 0.1 V flies 0.2 V like 0.2 P like 0.1 DT an 0.3 N arrow 0.1

34 Finding all the paths: to build the trellis time flies like an arrow NN N N VVVV PPPP DT BOS N V P DT

35 Finding all the paths (cont) time flies like an arrow NN N N VVVV PPPP DT BOS N V P DT

36 Viterbi algorithm The probability of the best path that produces O 1,t-1 while ending up in state s j : Initialization: Induction:  Modify it to allow ² -emission

37 Proof of the recursive function

38 Viterbi algorithm: calculating ± j (t) # N is the number of states in the HMM structure # observ is the observation O, and leng is the length of observ. Initialize viterbi[0..leng] [0..N-1] to 0 for each state j viterbi[0] [j] = ¼ [j] back-pointer[0] [j] = -1 # dummy for (t=0; t<leng; t++) for (j=0; j<N; j++) k=observ[t] # the symbol at time t viterbi[t+1] [j] = max i viterbi[t] [i] a ij b jk back-pointer[t+1] [j] = arg max i viterbi[t] [i] a ij b jk

39 Viterbi algorithm: retrieving the best path # find the best path best_final_state = arg max j viterbi[leng] [j] # start with the last state in the sequence j = best_final_state push(arr, j); for (t=leng; t>0; t--) i = back-pointer[t] [j] push(arr, i) j = i return reverse(arr)

40 Hw7 and Hw8 Hw7: write an HMM “class”: –Read HMM input file –Output HMM Hw8: implement the algorithms for two HMM tasks: –HMM as parser: Viterbi algorithm –HMM as LM: the prob of an observation

41 Implementation issue storing HMM Approach #1: ¼ i : pi {state_str} a ij : a {from_state_str} {to_state_str} b jk : b {state_str} {symbol} Approach #2: state2idx{state_str} = state_idx symbol2idx{symbol_str} = symbol_idx ¼ i : pi [state_idx] = prob a ij : a [from_state_idx] [to_state_idx] = prob b jk : b [state_idx] [symbol_idx] = prob idx2state[state_idx] = state_str Idx2symbol[symbol_idx] = symbol_str

42 Storing HMM: sparse matrix a ij : a [i] [j] = prob b jk : b [j] [k] = prob a ij : a[i] = “j1 p1 j2 p2 …” a ij : a[j] = “i1 p1 i2 p2 …” b jk : b[j] = “k1 p1 k2 p2 ….” b jk : b[k] = “j1 p1 j2 p2 …”

43 Other implementation issues Index starts from 0 in programming, but often starts from 1 in algorithms The sum of logprob is used in practice to replace the product of prob. Check constraints and print out warning if the constraints are not met.

44 HMM as LM

45 HMM as an LM: computing P(o 1, …, o T ) 1 st try: - enumerate all possible paths - add the probabilities of all paths

46 Forward probabilities Forward probability: the probability of producing O 1,t-1 while ending up in state s i :

47 Calculating forward probability Initialization: Induction:

48

49 Summary Definition: hidden states, output symbols Properties: Markov assumption Applications: POS-tagging, etc. Three basic questions in HMM –Find the probability of an observation: forward probability –Find the best sequence: Viterbi algorithm –Estimate probability: MLE Bigram POS tagger: decoding with Viterbi algorithm