Angelo Dalli Department of Intelligent Computing Systems

Slides:

Advertisements

Similar presentations

Hidden Markov Models (HMM) Rabiner’s Paper

Advertisements

Hidden Markov Models By Marc Sobel. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Introduction Modeling.

Automatic Speech Recognition II  Hidden Markov Models  Neural Network.

Hidden Markov Model 主講人：虞台文大同大學資工所智慧型多媒體研究室. Contents Introduction – Markov Chain – Hidden Markov Model (HMM) Formal Definition of HMM & Problems Estimate.

HIDDEN MARKOV MODELS Prof. Navneet Goyal Department of Computer Science BITS, Pilani Presentation based on: & on presentation on HMM by Jianfeng Tang Old.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Lecture 8: Hidden Markov Models (HMMs) Michael Gutkin Shlomi Haba Prepared by Originally presented at Yaakov Stein’s DSPCSP Seminar, spring 2002 Modified.

Introduction to Hidden Markov Models

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

An Introduction to Hidden Markov Models and Gesture Recognition Troy L. McDaniel Research Assistant Center for Cognitive Ubiquitous Computing Arizona State.

Hidden Markov Models Adapted from Dr Catherine Sweeney-Reed’s slides.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수

Hidden Markov Models Theory By Johan Walters (SR 2003)

Statistical NLP: Hidden Markov Models Updated 8/12/2005.

Hidden Markov Models Fundamentals and applications to bioinformatics.

Hidden Markov Models in NLP

Apaydin slides with a several modifications and additions by Christoph Eick.

INTRODUCTION TO Machine Learning 3rd Edition

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Fall 2001 EE669: Natural Language Processing 1 Lecture 9: Hidden Markov Models (HMMs) (Chapter 9 of Manning and Schutze) Dr. Mary P. Harper ECE, Purdue.

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.

Ch10 HMM Model 10.1 Discrete-Time Markov Process 10.2 Hidden Markov Models 10.3 The three Basic Problems for HMMS and the solutions 10.4 Types of HMMS.

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

7-Speech Recognition Speech Recognition Concepts

Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

HMM - Part 2 The EM algorithm Continuous density HMM.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Hidden Markov Models (HMMs) –probabilistic models for learning patterns in sequences (e.g. DNA, speech, weather, cards...) (2 nd order model)

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Other Models for Time Series. The Hidden Markov Model (HMM)

Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.

Hidden Markov Models HMM Hassanin M. Al-Barhamtoshy

MACHINE LEARNING 16. HMM. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Modeling dependencies.

Hidden Markov Models BMI/CS 576

Statistical Models for Automatic Speech Recognition

Hidden Markov Models Part 2: Algorithms

Hidden Markov Model LR Rabiner

4.0 More about Hidden Markov Models

CONTEXT DEPENDENT CLASSIFICATION

LECTURE 15: REESTIMATION, EM AND MIXTURES

Introduction to HMM (cont)

Hidden Markov Models By Manish Shrivastava.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

Angelo Dalli Department of Intelligent Computing Systems Hidden Markov Models Angelo Dalli Department of Intelligent Computing Systems

Overview Definition Simple Example 3 Basic Problems Forward Algorithm Viterbi Algorithm Baum - Welch Algorithm Example Application Conclusion

Definition Markov Model H = {A, B, N, P} , where…

Elements Set of N states {S1,…,SN} M distinct observation symbols V = {v1,…,vM} per state Our “finite grammar” We assume discrete here, but could be continuous State transition probability matrix A Observation probability distribution B B = {bj(k) | bj(k) = P(Ot = vk | qt = Sj), 1< k <M, 1< j <N} , where Ot,qt represent observation and state at time t respectively Again, could be continuous pdf modeled by something like Gaussian mixtures Initial state distribution P = {pi | pi = P(q1 = i), 1< i <N}

Matrix of state transition probabilities Where

Markov Chain with 5 states

Observable vs. Hidden Observable: output state is completely determined at each instance of time For example, if output at time t is state itself: 2 state heads/tails coin toss model Hidden: states must be inferred from observations In other words, observation is probabilistic function of state

Simple Example: Urn and Ball N urns sitting in a room Each one has M distinct colored balls Magic genie selects an urn at random, based on some probability distribution Genie selects ball randomly from this urn, tells us the color and puts it back She/he then moves on to next urn based on second prob distribution, and repeats process

Obvious Markov Model here: Each urn is a state Genie’s initial selection is based on initial state probability, P Probability of selecting a certain color determined by observation probability matrix, B The likelihood of the “next” urn is determined by the matrix of transition probabilities, A. At end we have observation sequence, for example O = {red, blue, green, red, green, magenta}

Where’s genie? If Genie location is known at each time instant t, then model is observed Otherwise, this is a hidden model, and we can only infer state at time t, given our string of observations and known probabilities

Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , how do we (efficiently) compute P(O | H) - Given several model choices, can be used to determine most appropriate one

Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , and Markov Model H = {A,B,P} , find optimal state sequence q = q1…qT Optimality criterion needs to be determined interest is finding the “correct” state sequence

Three Basic Problems for HMM’s Given observation sequence O = O1O2…OT , estimate parameters for Model H = {A,B,P} that maximize P(O | H) -observation sequence used here to train model, adapting it to best fit observed phenomenon

Problem 1 : compute P(O | H) Straighforward (bad) Solution For given state sequence q = {q1,…,qT} we have The probability of sequence q occurring is P(q | H) = piaq1q2…a(qT-1)qT

Bad solution continued Joint probability of O and q is product of two: P(O,q | H) = P(O | q,H)P(q | H) Probability of O is P(O,q | H) over set of all possible sequences Q: P(O | H) =

No Good Computation for this direct method is O(2TNT) Not reasonable even for small values of N and T Need to find efficient way

Problem 1 : Compute P(O | H) Efficient Solution The forward algorithm

The Forward Algorithm Let ft(i) = P(O1…Ot, qt = Si | H) Initialization: f1(i) = pibi(O1) , 1 < i < N Induction:

Forward Algorithm Finally: P(O | H) = Requires O(N2T) calculations Much less than direct method

Problem 2: Given O, H, find “optimal” q Of course, depends on optimality criterion Several likely candidates: Maximize number of correct individual states Does not consider transitions -> may lead to illegal sequences Maximize number of correct duples, triples, etc. Find single best state sequence i.e. maximize P(q | O,H) This is most common criterion, and it is solved via the Viterbi algorithm

Prob 2 solution: Viterbi Algorithm Define: -Highest prob of single path at time t ending in state Si Inductively speaking:

Viterbi Algorithm Need to keep track of argument which maximizes our delta function for each timet,state i We use array rt(i) Now: Initialization: r1(i) = 0 , 1 < i < N

Recursion: rt(i) = At end, we have final probability and the end state:

Backtrack to get entire path: t = T-1, T-2,…, 1

Problem 3: Given O, estimate parameters for H to maximize P(O|H) No known way to analytically maximize P(O | H), or to solve for optimal parameters Can locally maximize P(O | H) with Baum - Welch Algorithm

Solution to 3: Baum - Welch Algorithm Quite lengthy and beyond our time frame Suffice to say, it works Other solutions to 3 used, including EM

Ergodic vs. Left-to-Right Ergodic model: Left-to-Right Model:

Reduces size of model, and makes prob 3 easier Variations on HMM Null transition Transition between states that produces no output For ex: to model alternate word pronunciations Tied Parameters Set up equivalence relation between parameters For ex: between observation prob of 2 states which have same B Reduces size of model, and makes prob 3 easier State duration density Inherent prob of staying in state Si for d iterations is (aii)d-1(1-aii) May not be appropriate for physical signals, and so an explicit state duration probability density is introduced

Issues with HMM implementation Scaling Product of very small terms -> machine may not be precise enough, so we scale Multiple observation sequences In left-to-right model, small number of observations available for each state, requiring several sequences for parameter estimation (prob 3) Initial estimate Normal distributions fine for P ,A , but B is sensitive to initial estimate Again, this is an issue for problem 3

Issues with HMM implementation Insufficient training data For ex: not enough occurrences of different events in a given state Possible solution: reduce model to subset for which more data exists, and linearly interpolate between model parameters Interp weightings a function of amount of training data Alternately, could impose some lower bound on individual observation probabilities Model choice E rgodic vs. LTR (or other), Continuous vs. discrete observation densities, number of states, etc.

Markov Processes Used in Composition Xenakis Tenney Hiller Chadabe (performance) Charles Ames Student of Hiller Many others since

Example Application Isolated word recognition (Rabiner) Each word v modeled as distinct HMM Hv Training set of k occurrences per word O1,…,Ok Each of which is an observation sequence Need to: estimate parameters for each Hv that maximize P(O1,…,Ok | Hv) (i.e. prob 3) Extract features O = (O1,…,OT) from unknown word Calculate P(O | Hv) for all v (prob 1), find v which maximizes

Make Observation Feature extraction: at each frame, cepstral coefficients and their derivatives are taken Vector Quantization: observed frame is mapped to possible observation (codebook entry) via nearest neighbor Assuming discrete observation probability Codebook entries estimated by segmenting training data, and taking centroid of all frame vectors for each segment. A la k-means clustering

Choice of Model and Parameters Left-to-Right model more appropriate Thus we have P(q1 = S1) = 1 Choice of states - two ideas: Let state correspond to phoneme Let state correspond to analysis frame Update model parameters: Segment training data into states based on current model using Viterbi algorithm (prob 2) Update A,B probabilities based on observed data Ex: bj(Ok) number of observed vectors nearest to Ok in state j divided by total number of observed vectors in state j

State Duration Density If phoneme segmentation used, it may be advantageous to determine a state duration density Variable state length for each phoneme Pyramid of death

Conclusion Advantages Limitations Has contributed quite a bit to speech recognition With algorithms we have described, computation is reasonable Complex processes can be modeled with low-dimensional data Works well for time varying classification other examples: gesture recognition, formant tracking Limitations Assumption that successive observations are independent First order assumption: probability state at time t only depends on state at time t-1 Need to be “tailor made” for specific application Needs lots of training data, in order to see all observations