Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing CMSC 25000 February 22, 2005.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Angelo Dalli Department of Intelligent Computing Systems
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.
Statistical NLP: Lecture 11
Hidden Markov Models Theory By Johan Walters (SR 2003)
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
INTRODUCTION TO Machine Learning 3rd Edition
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
COMP 4060 Natural Language Processing Speech Processing.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
CS 4705 Hidden Markov Models Julia Hirschberg CS4705.
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Speech, Perception, & AI Artificial Intelligence CMSC March 5, 2002.
Entropy & Hidden Markov Models Natural Language Processing CMSC April 22, 2003.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
NLP. Introduction to NLP Sequence of random variables that aren’t independent Examples –weather reports –text.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC February 26, 2008.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Learning, Uncertainty, and Information: Learning Parameters
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Uncertain Reasoning over Time
CSC 594 Topics in AI – Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
CPSC 503 Computational Linguistics
Presentation transcript:

Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing CMSC February 22, 2005

Agenda Hidden Markov Models –Uncertain observation –Temporal Context –Recognition: Viterbi –Training the model: Baum-Welch Speech Recognition –Framing the problem: Sounds to Sense –Speech Recognition as Modern AI

Modelling Processes over Time Infer underlying state sequence from observed Issue: New state depends on preceding states –Analyzing sequences Problem 1: Possibly unbounded # prob tables –Observation+State+Time Solution 1: Assume stationary process –Rules governing process same at all time Problem 2: Possibly unbounded # parents –Markov assumption: Only consider finite history –Common: 1 or 2 Markov: depend on last couple

Hidden Markov Models (HMMs) An HMM is: –1) A set of states: –2) A set of transition probabilities: Where aij is the probability of transition qi -> qj –3)Observation probabilities: The probability of observing ot in state i –4) An initial probability dist over states: The probability of starting in state i –5) A set of accepting states

Three Problems for HMMs Find the probability of an observation sequence given a model –Forward algorithm Find the most likely path through a model given an observed sequence –Viterbi algorithm (decoding) Find the most likely model (parameters) given an observed sequence –Baum-Welch (EM) algorithm

Bins and Balls Example Assume there are two bins filled with red and blue balls. Behind a curtain, someone selects a bin and then draws a ball from it (and replaces it). They then select either the same bin or the other one and then select another ball… –(Example due to J. Martin)

Bins and Balls Example Bin 1Bin

Bins and Balls Bin1Bin2 Bin Bin Π Bin 1: 0.9; Bin 2: 0.1 A B Bin 1Bin 2 Red Blue0.30.6

Bins and Balls Assume the observation sequence: – Blue Blue Red (BBR) Both bins have Red and Blue –Any state sequence could produce observations However, NOT equally likely –Big difference in start probabilities –Observation depends on state –State depends on prior state

Bins and Balls Blue Blue Red 1 1 1(0.9*0.3)*(0.6*0.3)*(0.6*0.7)= (0.9*0.3)*(0.6*0.3)*(0.4*0.4)= (0.9*0.3)*(0.4*0.6)*(0.3*0.7)= (0.9*0.3)*(0.4*0.6)*(0.7*0.4)= (0.1*0.6)*(0.3*0.7)*(0.6*0.7)= (0.1*0.6)*(0.3*0.7)*(0.4*0.4)= (0.1*0.6)*(0.7*0.6)*(0.3*0.7)= (0.1*0.6)*(0.7*0.6)*(0.7*0.4)=0.0070

Answers and Issues Here, to compute probability of observed –Just add up all the state sequence probabilities To find most likely state sequence –Just pick the sequence with the highest value Problem: Computing all paths expensive –2T*N^T Solution: Dynamic Programming –Sweep across all states at each time step Summing (Problem 1) or Maximizing (Problem 2)

Forward Probability Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the max state, T is the last time

Forward Algorithm Idea: matrix where each cell forward[t,j] represents probability of being in state j after seeing first t observations. Each cell expresses the probability: forward[t,j] = P(o 1,o 2,...,o t,q t =j|w) q t = j means "the probability that the t th state in the sequence of states is state j. Compute probability by summing over extensions of all paths leading to current cell. An extension of a path from a state i at time t-1 to state j at t is computed by multiplying together: i. previous path probability from the previous cell forward[t- 1,i], ii. transition probability a ij from previous state i to current state j iii. observation likelihood b jt that current state j matches observation symbol t.

Forward Algorithm Function Forward(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix forwardi[num-states+2,T+2] Forward[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-Forward[s,t]*at[s,s’]*bs’(ot) Forward[s’,t+1] <- Forward[s’,t+1]+new-score

Viterbi Algorithm Find BEST sequence given signal –Best P(sequence|signal) –Take HMM & observation sequence => seq (prob) Dynamic programming solution –Record most probable path ending at a state i Then most probable path from i to end O(bMn)

Viterbi Code Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]==0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return

Modeling Sequences, Redux Discrete observation values –Simple, but inadequate –Many observations highly variable Gaussian pdfs over continuous values –Assume normally distributed observations Typically sum over multiple shared Gaussians –“Gaussian mixture models” –Trained with HMM model

Learning HMMs Issue: Where do the probabilities come from? Solution: Learn from data –Trains transition (aij) and emission (bj) probabilities Typically assume structure –Baum-Welch aka forward-backward algorithm Iteratively estimate counts of transitions/emitted Get estimated probabilities by forward comput’n –Divide probability mass over contributing paths

Learning HMMs Issue: Where do the probabilities come from? Supervised/manual construction Solution: Learn from data –Trains transition (aij), emission (bj), and initial (πi) probabilities Typically assume state structure is given –Unsupervised –Baum-Welch aka forward-backward algorithm Iteratively estimate counts of transitions/emitted Get estimated probabilities by forward comput’n –Divide probability mass over contributing paths

Manual Construction Manually labeled data –Observation sequences, aligned to –Ground truth state sequences Compute (relative) frequencies of state transitions Compute frequencies of observations/state Compute frequencies of initial states Bootstrapping: iterate tag, correct, reestimate, tag. Problem: –Labeled data is expensive, hard/impossible to obtain, may be inadequate to fully estimate Sparseness problems

Unsupervised Learning Re-estimation from unlabeled data –Baum-Welch aka forward-backward algorithm –Assume “representative” collection of data E.g. recorded speech, gene sequences, etc –Assign initial probabilities Or estimate from very small labeled sample –Compute state sequences given the data I.e. use forward algorithm –Update transition, emission, initial probabilities

Updating Probabilities Intuition: –Observations identify state sequences –Adjust probability of transitions/emissions –Make closer to those consistent with observed – Increase P(Observations|Model) Functionally –For each state i, what proportion of transitions from state i go to state j –For each state i, what proportion of observations match O? –How often is state i the initial state?

Estimating Transitions Consider updating transition aij –Compute probability of all paths using aij –Compute probability of all paths through i (w/ and w/o i->j) ij

Forward Probability Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the max state, T is the last time

Forward Probability Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the final state, T is the last time, and 1 is the start state

Backward Probability Where β is the backward probability, t is the time in sequence, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the final state, and T is the last time

Re-estimating Estimate transitions from i->j Estimate observations in j Estimate initial i

Speech Recognition Goal: –Given an acoustic signal, identify the sequence of words that produced it –Speech understanding goal: Given an acoustic signal, identify the meaning intended by the speaker Issues: –Ambiguity: many possible pronunciations, –Uncertainty: what signal, what word/sense produced this sound sequence

Decomposing Speech Recognition Q1: What speech sounds were uttered? –Human languages: phones Basic sound units: b, m, k, ax, ey, …(arpabet) Distinctions categorical to speakers –Acoustically continuous Part of knowledge of language –Build per-language inventory –Could we learn these?

Decomposing Speech Recognition Q2: What words produced these sounds? –Look up sound sequences in dictionary –Problem 1: Homophones Two words, same sounds: too, two –Problem 2: Segmentation No “space” between words in continuous speech “I scream”/”ice cream”, “Wreck a nice beach”/”Recognize speech” Q3: What meaning produced these words? –NLP (But that’s not all!)

Signal Processing Goal: Convert impulses from microphone into a representation that –is compact –encodes features relevant for speech recognition Compactness: Step 1 –Sampling rate: how often look at data 8KHz, 16KHz,(44.1KHz= CD quality) –Quantization factor: how much precision 8-bit, 16-bit (encoding: u-law, linear…)

(A Little More) Signal Processing Compactness & Feature identification –Capture mid-length speech phenomena Typically “frames” of 10ms (80 samples) –Overlapping –Vector of features: e.g. energy at some frequency –Vector quantization: n-feature vectors: n-dimension space –Divide into m regions (e.g. 256) –All vectors in region get same label - e.g. C256

Speech Recognition Model Question: Given signal, what words? Problem: uncertainty –Capture of sound by microphone, how phones produce sounds, which words make phones, etc Solution: Probabilistic model –P(words|signal) = – P(signal|words)P(words)/P(signal) –Idea: Maximize P(signal|words)*P(words) P(signal|words): acoustic model; P(words): lang model

Language Model Idea: some utterances more probable Standard solution: “n-gram” model –Typically tri-gram: P(wi|wi-1,wi-2) Collect training data –Smooth with bi- & uni-grams to handle sparseness –Product over words in utterance

Acoustic Model P(signal|words) –words -> phones + phones -> vector quantiz’n Words -> phones –Pronunciation dictionary lookup Multiple pronunciations? –Probability distribution »Dialect Variation: tomato »+Coarticulation –Product along path towmaaeytow 0.5 towmaaeytow 0.5 ax

Pronunciation Example Observations: 0/1

Acoustic Model P(signal| phones): –Problem: Phones can be pronounced differently Speaker differences, speaking rate, microphone Phones may not even appear, different contexts –Observation sequence is uncertain Solution: Hidden Markov Models –1) Hidden => Observations uncertain –2) Probability of word sequences => State transition probabilities –3) 1 st order Markov => use 1 prior state

Sequence Pronunciation Model

Acoustic Model 3-state phone model for [m] –Use Hidden Markov Model (HMM) –Probability of sequence: sum of prob of paths OnsetMidEndFinal C1: 0.5 C2: 0.2 C3: 0.3 C3: 0.2 C4: 0.7 C5: 0.1 C4: 0.1 C6: 0.5 C6: 0.4 Transition probabilities Observation probabilities

Learning HMMs Issue: Where do the probabilities come from? Solution: Learn from data –Trains transition (aij) and emission (bj) probabilities Typically assume structure –Baum-Welch aka forward-backward algorithm Iteratively estimate counts of transitions/emitted Get estimated probabilities by forward comput’n –Divide probability mass over contributing paths

Forward Probability Where α is the forward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the final state, T is the last time, and 1 is the start state

Backward Probability Where β is the backward probability, t is the time in utterance, i,j are states in the HMM, aij is the transition probability, bj(ot) is the probability of observing ot in state bj N is the final state, T is the last time, and 1 is the start state

Re-estimating Estimate transitions from i->j Estimate observations in j

ASR Training Models to train: –Language model: typically tri-gram –Observation likelihoods: B –Transition probabilities: A –Pronunciation lexicon: sub-phone, word Training materials: –Speech files – word transcription –Large text corpus –Small phonetically transcribed speech corpus

Training Language model: –Uses large text corpus to train n-grams 500 M words Pronunciation model: –HMM state graph –Manual coding from dictionary Expand to triphone context and sub-phone models

HMM Training Training the observations: –E.g. Gaussian: set uniform initial mean/variance Train based on contents of small (e.g. 4hr) phonetically labeled speech set (e.g. Switchboard) Training A&B: –Forward-Backward algorithm training

Does it work? Yes: –99% on isolate single digits –95% on restricted short utterances (air travel) –80+% professional news broadcast No: –55% Conversational English –35% Conversational Mandarin –?? Noisy cocktail parties

N-grams Perspective: –Some sequences (words/chars) are more likely than others –Given sequence, can guess most likely next Used in – Speech recognition – Spelling correction, –Augmentative communication –Other NL applications

Corpus Counts Estimate probabilities by counts in large collections of text/speech Issues: –Wordforms (surface) vs lemma (root) –Case? Punctuation? Disfluency? –Type (distinct words) vs Token (total)

Basic N-grams Most trivial: 1/#tokens: too simple! Standard unigram: frequency –# word occurrences/total corpus size E.g. the=0.07; rabbit = –Too simple: no context! Conditional probabilities of word sequences

Markov Assumptions Exact computation requires too much data Approximate probability given all prior wds –Assume finite history –Bigram: Probability of word given 1 previous First-order Markov –Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Issues Relative frequency –Typically compute count of sequence Divide by prefix Corpus sensitivity –Shakespeare vs Wall Street Journal Very unnatural Ngrams –Unigram: little; bigrams: colloc; trigrams:phrase

Evaluating n-gram models Entropy & Perplexity –Information theoretic measures –Measures information in grammar or fit to data –Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn –E.g. 8 things: number as code => 3 bits/trans –Alt. short code if high prob; longer if lower Can reduce Perplexity: –Weighted average of number of choices

Entropy of a Sequence Basic sequence Entropy of language: infinite lengths –Assume stationary & ergodic

Cross-Entropy Comparing models –Actual distribution unknown –Use simplified model to estimate Closer match will have lower cross-entropy

Speech Recognition as Modern AI Draws on wide range of AI techniques –Knowledge representation & manipulation Optimal search: Viterbi decoding –Machine Learning Baum-Welch for HMMs Nearest neighbor & k-means clustering for signal id –Probabilistic reasoning/Bayes rule Manage uncertainty in signal, phone, word mapping Enables real world application