Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Processing Speech Recognition

Similar presentations


Presentation on theme: "Speech Processing Speech Recognition"— Presentation transcript:

1 Speech Processing Speech Recognition
August 19, 2005 1/18/2019

2 Hidden Markov Models Bonnie Dorr Christof Monz
CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004 1/18/2019

3 Hidden Markov Model (HMM)
HMMs allow you to estimate probabilities of unobserved events Given plain text, which underlying parameters generated the surface E.g., in speech recognition, the observed data is the acoustic signal and the words are the hidden parameters 1/18/2019

4 HMMs and their Usage HMMs are very common in Computational Linguistics: Speech recognition (observed: acoustic signal, hidden: words) Handwriting recognition (observed: image, hidden: words) Part-of-speech tagging (observed: words, hidden: part-of-speech tags) Machine translation (observed: foreign words, hidden: words in target language) 1/18/2019

5 ASR Lexicon: Markov Models for pronunciation
1/18/2019

6 The Hidden Markov model
1/18/2019

7 Formal definition of HMM
States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 1/18/2019

8 Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 1/18/2019

9 Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 1/18/2019

10 Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 1/18/2019

11 A real HMM 1/18/2019

12 Noisy Channel Model In speech recognition you observe an acoustic signal (A=a1,…,an) and you want to determine the most likely sequence of words (W=w1,…,wn): P(W | A) Problem: A and W are too specific for reliable counts on observed data, and are very unlikely to occur in unseen data 1/18/2019

13 Noisy Channel Model Assume that the acoustic signal (A) is already segmented wrt word boundaries P(W | A) could be computed as Problem: Finding the most likely word corresponding to a acoustic representation depends on the context E.g., /'pre-z&ns / could mean “presents” or “presence” depending on the context 1/18/2019

14 Noisy Channel Model Given a candidate sequence W we need to compute P(W) and combine it with P(W | A) Applying Bayes’ rule: The denominator P(A) can be dropped, because it is constant for all W 1/18/2019

15 Noisy Channel in a Picture

16 Decoding The decoder combines evidence from The likelihood: P(A | W)
This can be approximated as: The prior: P(W) 1/18/2019

17 Search Space Given a word-segmented acoustic sequence list all candidates Compute the most likely path 'bot ik-'spen-siv 'pre-z&ns boat excessive presidents bald expensive presence bold expressive presents bought inactive press 1/18/2019

18 Markov Assumption The Markov assumption states that probability of the occurrence of word wi at time t depends only on occurrence of word wi-1 at time t-1 Chain rule: Markov assumption: 1/18/2019

19 The Trellis 1/18/2019

20 Parameters of an HMM States: A set of states S=s1,…,sn
Transition probabilities: A= a1,1,a1,2,…,an,n Each ai,j represents the probability of transitioning from state si to sj. Emission probabilities: A set B of functions of the form bi(ot) which is the probability of observation ot being emitted by si Initial state distribution: is the probability that si is a start state 1/18/2019

21 The Three Basic HMM Problems
Problem 1 (Evaluation): Given the observation sequence O=o1,…,oT and an HMM model how do we compute the probability of O given the model? Problem 2 (Decoding): Given the observation sequence O=o1,…,oT and an HMM model how do we find the state sequence that best explains the observations? Problem 3 (Learning): How do we adjust the model parameters to maximize ? 1/18/2019

22 Problem 1: Probability of an Observation Sequence
What is ? The probability of a observation sequence is the sum of the probabilities of all possible state sequences in the HMM. Naïve computation is very expensive. Given T observations and N states, there are NT possible state sequences. Even small HMMs, e.g. T=10 and N=10, contain 10 billion different paths Solution to this and problem 2 is to use dynamic programming 1/18/2019

23 Forward Probabilities
What is the probability that, given an HMM , at time t the state is i and the partial observation o1 … ot has been generated? 1/18/2019

24 Forward Probabilities
1/18/2019

25 Forward Algorithm Initialization: Induction: Termination: 1/18/2019

26 Forward Algorithm Complexity
In the naïve approach to solving problem 1 it takes on the order of 2T*NT computations The forward algorithm takes on the order of N2T computations 1/18/2019

27 Backward Probabilities
Analogous to the forward probability, just in the other direction What is the probability that given an HMM and given the state at time t is i, the partial observation ot+1 … oT is generated? 1/18/2019

28 Backward Probabilities
1/18/2019

29 Backward Algorithm Initialization: Induction: Termination: 1/18/2019

30 Problem 2: Decoding The solution to Problem 1 (Evaluation) gives us the sum of all paths through an HMM efficiently. For Problem 2, we wan to find the path with the highest probability. We want to find the state sequence Q=q1…qT, such that 1/18/2019

31 Viterbi Algorithm Similar to computing the forward probabilities, but instead of summing over transitions from incoming states, compute the maximum Forward: Viterbi Recursion: 1/18/2019

32 Viterbi Algorithm Initialization: Induction: Termination:
Read out path: 1/18/2019

33 Problem 3: Learning Up to now we’ve assumed that we know the underlying model Often these parameters are estimated on annotated training data, which has two drawbacks: Annotation is difficult and/or expensive Training data is different from the current data We want to maximize the parameters with respect to the current data, i.e., we’re looking for a model , such that 1/18/2019

34 Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that But it is possible to find a local maximum Given an initial model , we can always find a model , such that 1/18/2019

35 Parameter Re-estimation
Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters 1/18/2019

36 Parameter Re-estimation
Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: ai,j Emission probabilities: bi(ot) 1/18/2019

37 Re-estimating Transition Probabilities
What’s the probability of being in state si at time t and going to state sj, given the current model and parameters? 1/18/2019

38 Re-estimating Transition Probabilities
1/18/2019

39 Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is Formally: 1/18/2019

40 Re-estimating Transition Probabilities
Defining As the probability of being in state si, given the complete observation O We can say: 1/18/2019

41 Review of Probabilities
Forward probability: The probability of being in state si, given the partial observation o1,…,ot Backward probability: The probability of being in state si, given the partial observation ot+1,…,oT Transition probability: The probability of going from state si, to state sj, given the complete observation o1,…,oT State probability: The probability of being in state si, given the complete observation o1,…,oT 1/18/2019

42 Re-estimating Initial State Probabilities
Initial state distribution: is the probability that si is a start state Re-estimation is easy: Formally: 1/18/2019

43 Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as Formally: where Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! 1/18/2019

44 The Updated Model Coming from we get to by the following update rules:
1/18/2019

45 Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters 1/18/2019

46 ASR Lexicon: Markov Models for pronunciation
1/18/2019

47 The Hidden Markov model
1/18/2019

48 Formal definition of HMM
States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 1/18/2019

49 Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 1/18/2019

50 Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 1/18/2019

51 Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 1/18/2019

52 A real HMM 1/18/2019

53 Cross-word triphones Word-Internal Context-Dependent Models
‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 1/18/2019

54 Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next step - Decoding: how to combine these to compute words from speech! 1/18/2019

55 Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 1/18/2019

56 Acoustic Modeling: MLP computes p(q|o)
1/18/2019

57 Gaussian Mixture Models
Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 1/18/2019

58 Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 1/18/2019

59 Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 1/18/2019

60 But we need 39 gaussians, not 1!
The observation o is really a vector of length 39 So need a vector of Gaussians: 1/18/2019

61 Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 1/18/2019

62 Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 1/18/2019

63 Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 1/18/2019


Download ppt "Speech Processing Speech Recognition"

Similar presentations


Ads by Google