Hidden Markov Models in NLP

Slides:



Advertisements
Similar presentations
Hidden Markov Models (HMM) Rabiner’s Paper
Advertisements

Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models By Marc Sobel. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Modeling.
Hidden Markov Model 主講人:虞台文 大同大學資工所 智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.
HIDDEN MARKOV MODELS Prof. Navneet Goyal Department of Computer Science BITS, Pilani Presentation based on: & on presentation on HMM by Jianfeng Tang Old.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Introduction to Hidden Markov Models
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Statistical NLP: Lecture 11
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Apaydin slides with a several modifications and additions by Christoph Eick.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
INTRODUCTION TO Machine Learning 3rd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Ch 13. Sequential Data (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Kim Jin-young Biointelligence Laboratory, Seoul.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.
Isolated-Word Speech Recognition Using Hidden Markov Models
THE HIDDEN MARKOV MODEL (HMM)
Graphical models for part of speech tagging
7-Speech Recognition Speech Recognition Concepts
HMM - Basics.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Hidden Markov Models Usman Roshan CS 675 Machine Learning.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
1 Hidden Markov Models Hsin-Min Wang Institute of Information Science, Academia Sinica References: 1.L. R. Rabiner and B. H. Juang,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Hidden Markov Models (HMMs) Chapter 3 (Duda et al.) – Section 3.10 (Warning: this section has lots of typos) CS479/679 Pattern Recognition Spring 2013.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Classification of melody by composer using hidden Markov models Greg Eustace MUMT 614: Music Information Acquisition, Preservation, and Retrieval.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Data-Intensive Computing with MapReduce Jimmy Lin University of Maryland Thursday, March 14, 2013 Session 8: Sequence Labeling This work is licensed under.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy
MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.
Hidden Markov Models BMI/CS 576
CONTEXT DEPENDENT CLASSIFICATION
Presentation transcript:

Hidden Markov Models in NLP Leah Spontaneo

Overview Introduction Discrete Markov Processes Hidden Markov Model Deterministic Model Statistical Model Speech Recognition Discrete Markov Processes Hidden Markov Model Elements of HMMs Output Sequence Basic HMM Problems HMM Operations Forward-Backward Viterbi Algorithm Baum-Welch Different Types of HMMs Speech Recognition using HMMs Isolated Word Recognition Limitations of HMMs

Introduction Real-world processes generally produce observable outputs characterized in signals Signals can be discrete or continuous Signal source can be stationary or nonstationary They can be pure or corrupted

Introduction Signals are characterized in signal models Process the signal to provide desired output Learn about the signal source without the source available Work well in practice Signal models can be deterministic or statistical

Deterministic Model Exploit some known specific properties of the signal Sine wave Sum of exponentials Chaos theory Specification of the signal generally straight-forward Determine/estimate the values of the parameters of the signal model

Statistical Model Statistical models try to characterize the statistical properties of the signal Gaussian processes Markov processes Hidden Markov processes Signal characterized as a parametric random process Parameters of the stochastic process can be determine/estimated in a precise, well-defined manner

Speech Recognition Basic theory of hidden Markov models in speech recognition originally pulished in 1960s by Baum and collegues Implemented in speech processing applications in 1970s be Baker at CMU and by Jelinek at IBM

Discrete Markov Processes Contains a set of N distinct states: S1, S2,..., SN At discrete time intervals the state changes and the state at time t is qt For the first-order Markov chain, the probabilistic description is P[qt = Sj| qt -1= Si, qt -2= Sk,...] = P[qt = Sj| qt -1= Si]

Discrete Markov Processes Processes considered are those independent of time leading to transition probabilities aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N aij ≥ 0 Σ aij = 1 The revious stochastic process is considered an observable Markov model The output process is the set of states at each time interval and each state corresponds to an observable event

Hidden Markov Model Markov process decides future probabilities based on recent values A hidden Markov model is a Markov process with an unobservable state HMMs must have 3 sets of probabilities: Initial probabilities Transition probabilities Emission probabilities -Genome includes unobserved states (CGI or baseline), HMMs are a natural model to consider -States are inferred from the base-to-base transitions observed along the genome

Hidden Markov Model Includes the case where the observation is a probabilistic function of the state A doubly embedded stochastic process with an underlying unobservable stochastic process Unobservable process only observed through a set of stochastic processes producing the observations

Elements of HMMs N, number of states in the model Although hidden, there is physical significance attached to the states of the model Individual states are denoted as S = {S1, S2,..., SN} State at time t is denoted qt M, number of distinct observation symbols for each state

Elements of HMMs Observation symbols correspond to the output of the system modeled Individual symbols are denoted V = {v1, v2,..., vM} State transition probability distribution A = {aij} aij = P[qt = Sj| qt -1= Si] 1 ≤ i, j ≤ N The special case where any state can reach any other, aij > 0 for all i, j

Elements of HMMs The observation probability distribution in state j, B = {bj(k)} bj(k) = P[vk at t|qt = Sj] 1 ≤ j ≤ N, 1 ≤ k ≤ M The initial state distribution π = {πi} πi = P[q1 = Si] 1 ≤ i ≤ N With the right values for N, M, A, B, and π the HMM can generate and output sequence O = O1O2...OT where each Ot is an observation in V and T is the total number of observations in the sequence

Output Sequence 1) Choose an initial state q1 = Si according to π 3) Get Ot = vk based on the emission probability for Si, bi(k) 4) Transition to new state qt+1 = Sj based on transition probability for Si, aij 5) t = t + 1 and go back to #3 if t < T; otherwise end sequence Successfully used for acoustic modeling in speech recognition Applied to language modeling and POS tagging

Output Sequence The procedure can be used to generate a sequence of observations and to model how an observation sequence was produced by and HMM The cost of determining the probability that the system is in state Si at time t is O(tN2)

Basic HMM Problems HMMs can find the state sequence that most likely produces a given output The sequence of states is most efficiently computed using the Viterbi algorithm Maximum likelihood estimates of the probability sets are determined using the Baum-Welch algorithm -HMMs usually associated with three main problems: compute probability that a particular sequence is associated with a given model; find the state sequence that would most likely produce a given output sequence; determine the maximum likelihood estimate of the probabilities based on a given output sequence

HMM Operations Calculating P(qt = Si|O1O2...Ot) uses the forward- backward algorithm Computing Q* = argmaxQ P(Q|O) requires the Viterbi algorithm Learning λ* = argmaxλ P(O|λ) using the Baum-Welch algorithm The complexity for the three algorithms is O(TN2) where T is the time taken and N is the number of states

Evaluation Scores how well a given model matches an observation sequence Extremely useful in considering which model, among many, best represents the set of observations

Forward-Backward Given observations O1O2...OT αt(i) = P(O1O2...Ot ^ qt = Si| λ) is the probability that given the first t observations, we end up in state Si on visit t α1(i) = b(O1) πi α t+1(j) = Σ aijbi(Ot+1) αt(i) We can now cheaply compute αt(i) = P(O1O2...Ot ^ qt = Si) P(O1O2...Ot) = Σ αt(i) P(qt = Si| O1O2...Ot) = αt(i)/ Σ αt(j)

Forward-Backward The key is since there are only N states, all possible state sequences will merge to the N nodes At t = 1, only the values of α1(i), 1 ≤ i ≤ N require calculation When t = 2, 3,..., T we calculate αt(j), 1 ≤ j ≤ N where each calculation involves N previous values of αt-1(i) only

State Sequence There is no ‘correct’ state sequence except for degenerate models Optimality criterion is used instead to determine the best possible outcome There are several reasonable optimality criteria and thus, the chosen criterion depends on the use of the uncovered sequence Used for continuous speech recognition

Viterbi Algorithm Finds the most likely sequence of hidden states based on known observations using dynamic programming Makes three assumptions about the model: The model must be a state machine Transitions between states are marked by a metric Events must be cumulative over the path Path history must be kept in memory to find the best probable path in the end -Similar to the forward algorithm which computes the probability that a set of observed events was generated by the model -Designed in 1967 by Andrew Viterbi to decode convolutional codes within noise of digital comm. links -Must contain finite number of states and at any given point in time be in one state -Time computed over each event that passes -All possible previous paths added to new state metric based on observation and best path is then chosen

Viterbi Algorithm Used for speech recognition where the hidden state is part of word formation Given a specific signal, it would deduce the most probable word based on the model To find the best state sequence, Q = {q1, q2,..., qT} for the observation sequence O = {O1O2...OT} we define

Viterbi Algorithm δt(i) is the best score along a single path at time t accounting for the first t observations ending in state Si The inductive step is

Viterbi Algorithm Similar to forward calculation of the forward- backward algorithm The major difference is the maximization over the previous states instead of summing the probabilities

Optimizing Parameters A training sequence is used to train the HMM and adjust the model parameters Training problem is crucial for most HMM applications Allows us to optimally adapt model parameters to observed training data creating good models for real data

Baum-Welch A special case of expectation maximization EM has two main steps: Devise the expectation of the log-likelihood using current estimates of latent variables Compute the maximized log-likelihood using values from the first step Baum-Welch is a form of generalized EM which allows the algorithm to converge to a local optimum -Parameters need to be trained on a sequence of symbols to determine the final probabilities the model will use to find the Viterbi path

Baum-Welch Also known as the forward-backwards algorithm Baum-Welch uses two main steps: The forward and backward probability is calculated for each state in the HMM The transition and emission probabilities are determined and divided by the probability of the whole model based on the previous step -It can produce maximum likelihood and posterior estimates for all parameters given only initial emission probabilities to work with -Starts by assigning initial probabilities to all parameters, then continues until convergence happens by adjusting probabilities of each parameter in accordance with the training set being scanned

Baum-Welch Cons: lots of local minima Pros: local minima are often adequate models of the data EM requires the number of states to be given Sometimes HMMs require some links to be zero. For this aij = 0 in the initial estimate λ(0)

Different Types of HMM Left-right model As time increases the state index increases or stays the same No transitions are allowed between states whose indecies are lower than the current state Cross-coupled two parallel left-right Obeys the left-right constraints on transition probabilities but provide more flexibility

Speech Recognition using HMMs Feature Analysis: A spectral/temporal analysis of speech signals can be performed to provide observation vectors used to train HMMs Unit Matching: Each unit is characterized by an HMM with parameters estimated from speech data Provides the likelihoods of matches of all sequences of speech recognition units to the unknown input

Isolated Word Recognition Vocabulary of V words to be recognized and each modeled by a distinct HMM For each word, a training set of K occurrences of each spoken word where each occurrence of the word is an observation sequence For each word v, build an HMM (estimating the model parameters which optimize the likelihood of the training set observations for each word)

Isolated Word Recognition For each unknown word recognized, measure the observation sequence O = {O1O2...OT} via feature analysis of the speech corresponding to the word Calculate model likelihoods for all possible models followed by the selection of the word with the highest model likelihood

Isolated Word Recognition

Limitations of HMMs Assumption that successive observations are independent and thus, the probability of an observation sequence can be written as the product of the probabilities of individual observations Assumes the distributions of individual observation parameters are well represented as a mixture of autoregressive or Gaussian densities Assumption of being in a given state at each time interval inappropriate for speech sound which can extend through several states

Questions?

References Vogel, S., et al. “HMM-based word alignment in statistical translation.” In Proceedings of the 16th conference on Computational linguistics (1996), pp. 836-841.  Rabiner, L. R. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” Proceedings of the IEEE (1989), pp. 257-286. Moore, A. W. “Hidden Markov Models.” Carnegie Mellon University. https://wiki.cse.yorku.ca/course_archive/2010- 11/F/6390/_media/hmm14.pdf http://en.wikipedia.org/wiki/Hidden_Markov_model