Hidden Markov Models (HMMs)

Slides:



Advertisements
Similar presentations
Lecture 16 Hidden Markov Models. HMM Until now we only considered IID data. Some data are of sequential nature, i.e. have correlations have time. Example:
Advertisements

Hidden Markov Models (HMM) Rabiner’s Paper
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Large Vocabulary Unconstrained Handwriting Recognition J Subrahmonia Pen Technologies IBM T J Watson Research Center.
Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Mixture Language Models and EM Algorithm
… Hidden Markov Models Markov assumption: Transition model:
Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Forward-backward algorithm LING 572 Fei Xia 02/23/06.
. Hidden Markov Models with slides from Lise Getoor, Sebastian Thrun, William Cohen, and Yair Weiss.
Hidden Markov Models David Meir Blei November 1, 1999.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Cognitive Computer Vision 3R400 Kingsley Sage Room 5C16, Pevensey III
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
7-Speech Recognition Speech Recognition Concepts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models for Information Extraction CSE 454.
Sampling Approaches to Pattern Extraction
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
1 Hidden Markov Models (HMMs). 2 Definition Hidden Markov Model is a statistical model where the system being modeled is assumed to be a Markov process.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Discovering Evolutionary Theme Patterns from Text - An Exploration of Temporal Text Mining Qiaozhu Mei and ChengXiang Zhai Department of Computer Science.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
Discovering Evolutionary Theme Patterns from Text -An exploration of Temporal Text Mining KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Qiaozhu Mei.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Hidden Markov Models BMI/CS 576
Learning, Uncertainty, and Information: Learning Parameters
Statistical Language Models
Hidden Markov Models (HMMs)
Hidden Markov Models (HMMs)
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models - Training
Computational NeuroEngineering Lab
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models (HMMs)
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Hidden Markov Model LR Rabiner
Bayesian Inference for Mixture Language Models
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CONTEXT DEPENDENT CLASSIFICATION
Hassanin M. Al-Barhamtoshy
Speech Processing Speech Recognition
Algorithms of POS Tagging
LECTURE 15: REESTIMATION, EM AND MIXTURES
Topic Models in Text Processing
Hidden Markov Models Teaching Demo The University of Arizona
Introduction to HMM (cont)
Hidden Markov Models By Manish Shrivastava.
Language Models for TR Rong Jin
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Hidden Markov Models (HMMs) ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

Central Questions to Ask about a LM: “ADMI” Application: Why do you need a LM? For what purpose? Data: What kind of data do you want to model? Model: How do you define the model? Inference: How do you infer/estimate the parameters? Topic mining and analysis Sequential structure discovery Sequence of words Latent state, Markov assumption Viterbi, Baum-Welch Evaluation metric for a LM Data set for estimation & evaluation Assumptions to be made Inference/Estimation algorithm

Modeling a Multi-Topic Document A document with 2 subtopics, thus 2 types of vocabulary … text mining passage food nutrition passage How do we model such a document? How do we “generate” such a document? How do we estimate our model? We’ve already seen one solution – unigram mixture model + EM This lecture is about another (better) solution – HMM + EM

Simple Unigram Mixture Model … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 food 0.00001 =0.7 Model/topic 1 p(w|1) … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 1-=0.3 Model/topic 2 p(w|2) p(w|1 2) = p(w|1)+(1- )p(w|2)

Deficiency of the Simple Mixture Model Adjacent words are sampled independently Cannot ensure passage cohesion Cannot capture the dependency between neighboring words We apply the text mining algorithm to the nutrition data to find patterns … Topic 1= text mining p(w|1) Topic 2= health p(w|2) Solution=?

The Hidden Markov Model Solution Basic idea: Make the choice of model for a word depend on the choice of model for the previous word Thus we model both the choice of model (“state”) and the words (“observations”) O: We apply the text mining algorithm to the nutrition data to find patterns … S1: 1 1 1 1 1 1 1 1 1 1 1 1 1 …… O: We apply the text mining algorithm to the nutrition data to find patterns … S2: 1 1 1 1 1 1 1 1 2 1 1 1 1 P(O,S1) > p(O,S2)?

Another Example: Passage Retrieval Strategy I: Build passage index Need pre-processing Query independent boundaries Efficient Strategy II: Extract relevant passages dynamically No pre-processing Dynamically decide the boundaries Slow Relevant Passages (“T”) Background text (“B”) O: We apply the text mining algorithm to the nutrition data to find patterns … S1: B T B T T T B B T T B T T O:…......... ….. We apply the text mining … patterns. ……………… S2: BB...BB T T T T T …. T BB….. BBBB

A Simple HMM for Recognizing Relevant Passages B T 0.8 0.2 0.5 P(w|T) Parameters Initial state prob: p(B)= 0.5; p(T)=0.5 State transition prob: p(BB)=0.8 p(BT)=0.2 p(TB)=0.5 p(TT)=0.5 Output prob: P(“the”|B) = 0.2 … P(“text”|T) = 0.1 … P(B)=0.5 P(T)=0.5 P(w|B) P(w|B) the 0.2 a 0.1 we 0.01 to 0.02 … text 0.0001 mining 0.00005 … text =0.125 mining = 0.1 association =0.1 clustering = 0.05 Topic Background

A General Definition of HMM Initial state probability: N states M symbols State transition probability: Output probability:

How to “Generate” Text? P(w|B) 0.8 P(w|T) 0.5 0.2 … 0.5 B T P(B)=0.5 the 0.2 a 0.1 we 0.01 is 0.02 … method 0.0001 mining 0.00005 0.2 … text =0.125 mining = 0.1 algorithm =0.1 clustering = 0.05 B T 0.5 P(B)=0.5 P(T)=0.5 B T 0.5 0.2 0.8 0.0001 0.05 0.1 0.02 0.125 the text mining algorithm is a clustering method P(BTT…,”the text mining…”)=p(B)p(the|B) p(T|B)p(text|T) p(T|T)p(mining|T)… = 0.5*0.2 * 0.2*0.125 * 0.5*0.1…

HMM as a Probabilistic Model Time/Index: t1 t2 t3 t4 … Data: o1 o2 o3 o4 … Sequential data Random variables/ process Observation variable: O1 O2 O3 O4 … Hidden state variable: S1 S2 S3 S4 … State transition prob: Probability of observations with known state transitions: Output prob. Joint probability (complete likelihood): Init state distr. Probability of observations (incomplete likelihood): State trans. prob.

Three Problems 1. Decoding – finding the most likely path Given: model, parameters, observations (data) Find: most likely states sequence (path) 2. Evaluation – computing observation likelihood Given: model, parameters, observations (data) Find: the likelihood to generate the data

Three Problems (cont.) 3. Training – estimating parameters Supervised Given: model structure, labeled data( data+states sequence) Find: parameter values Unsupervised Given: model structure, data (unlabeled)

Problem I: Decoding/Parsing Finding the most likely path This is the most common way of using an HMM (e.g., extraction, structure analysis)…

What’s the most likely path? P(w|B) 0.8 P(w|T) 0.5 the 0.2 a 0.1 we 0.01 is 0.02 … method 0.0001 mining 0.00005 0.2 … text =0.125 mining = 0.1 algorithm =0.1 clustering = 0.05 B T 0.5 P(B)=0.5 P(T)=0.5 ? the text mining algorithm is a clustering method

Viterbi Algorithm: An Example P(w|B) 0.8 P(w|T) 0.5 the 0.2 a 0.1 we 0.01 … algorithm 0.0001 mining 0.005 text 0.001 … the 0.05 text =0.125 mining = 0.1 algorithm =0.1 clustering = 0.05 0.2 B T 0.5 P(T)=0.5 P(B)=0.5 t = 1 2 3 4 … the text mining algorithm … 0.5 0.8 0.8 B B 0.8 B B 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 T 0.5 T 0.5 T T VP(B): 0.5*0.2 (B) 0.5*0.2*0.8*0.001(BB) … *0.5*0.005 (BTB) …*0.5*0.0001(BTTB) VP(T) 0.5*0.05(T) 0.5*0.2*0.2*0.125(BT) … *0.5*0.1 (BTT) …*0.5*0.1(BTTT) Winning path! 0.5

(Dynamic programming) Viterbi Algorithm Observation: Algorithm: (Dynamic programming) Complexity: O(TN2)

Problem II: Evaluation Computing the data likelihood Another use of an HMM, e.g., as a generative model for classification Also related to Problem III – parameter estimation

Data Likelihood: p(O|) the text mining algorithm … 0.5 0.8 0.8 B B 0.8 B B 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 T 0.5 T 0.5 T T In general, enumerate all paths BB… B BT… B TT… T Complexity of a naïve approach? 0.5

The Forward Algorithm … B B B B … T T T T α1

The Forward Algorithm … B B B B … T T T T α1 α2

The Forward Algorithm B B B B T T T T α4 α1 α2 α3 … B B B B … T T T T α4 α1 α2 α3 Generating o1…ot with ending state si The data likelihood is Complexity: O(TN2)

Forward Algorithm: Example the text mining algorithm … 0.5 0.8 B B 0.8 0.8 B B 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 T 0.5 T 0.5 T T 1(B): 0.5*p(“the”|B) 2(B): [1(B)*0.8+ 1(T)*0.5]*p(“text”|B) …… 1(T): 0.5*p(“the”|T) 2(T): [1(B)*0.2+ 1(T)*0.5]*p(“text”|T) …… P(“the text mining algorithm”) = 4(B)+ 4(T)

The Backward Algorithm Observation: Algorithm: (o1…ot already generated) Starting from state si Generating ot+1…oT Complexity: O(TN2) The data likelihood is

Backward Algorithm: Example the text mining algorithm … 0.5 0.8 0.8 B B 0.8 B B 0.2 0.2 0.2 0.5 0.5 0.5 0.5 0.5 T 0.5 T 0.5 T T … 4(B): 1 3(B): 0.8*p(“alg”|B)*4(B)+ 0.2*p(“alg”|T)*4(T) 3(T): 0.5*p(“alg”|B)*4(B)+ 0.5*p(“alg”|T)*4(T) 4(T): 1 P(“the text mining algorithm”) =  1(B)* 1(B)+  1(T)* 1(T) =  2(B)* 2(B)+  2(T)* 2(T)

Problem III: Training Estimating Parameters Where do we get the probability values for all parameters? Supervised vs. Unsupervised

Supervised Training Given: 1. N – the number of states, e.g., 2, (s1 and s2) 2. V – the vocabulary, e.g., V={a,b} 3. O – observations, e.g., O=aaaaabbbbb 4. State transitions, e.g., S=1121122222 Task: Estimate the following parameters 1. 1, 2 2. a11, a12, a22, a21 3. b1(a), b1(b), b2(a), b2(b) b1(a)=4/4=1.0; b1(b)=0/4=0; b2(a)=1/6=0.167; b2(b)=5/6=0.833 a11=2/4=0.5; a12=2/4=0.5 a21=1/5=0.2; a22=4/5=0.8 1=1/1=1; 2=0/1=0 1 2 0.5 0.8 0.2 P(s1)=1 P(s2)=0 P(a|s1)=1 P(b|s1)=0 P(a|s2)=167 P(b|s2)=0.833

Unsupervised Training Given: 1. N – the number of states, e.g., 2, (s1 and s2) 2. V – the vocabulary, e.g., V={a,b} 3. O – observations, e.g., O=aaaaabbbbb 4. State transitions, e.g., S=1121122222 Task: Estimate the following parameters 1. 1, 2 2. a11, a12, a22, a21 3. b1(a), b1(b), b2(a), b2(b) How could this be possible?  Maximum Likelihood:

Computation of P(O,qk|) is expensive … Intuition O=aaaaabbbbb,  P(O,q1|) P(O,qK|) P(O,q2|) q1=1111111111 q2=11111112211 … qK=2222222222 New ’ Computation of P(O,qk|) is expensive …

Baum-Welch Algorithm Basic “counters”: Computation of counters: Being at state si at time t Being at state si at time t and at state sj at time t+1 Computation of counters: Complexity: O(N2)

Baum-Welch Algorithm (cont.) Updating formulas: Overall complexity for each iteration: O(TN2)

An HMM for Information Extraction (Research Paper Headers)

What You Should Know Definition of an HMM What are the three problems associated with an HMM? Know how the following algorithms work Viterbi algorithm Forward & Backward algorithms Know the basic idea of the Baum-Welch algorithm

Readings Read [Rabiner 89] sections I, II, III Read the “brief note”