1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.

1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides were derived from Andrew Ng’s CS 229 notes, as well as lecture notes from Chen, Picheny et al, Yun-Hsuan Sung, and Bryan Pellom. I’ll try to give correct credit on each slide, but I’ll prob miss some.

2 LSA 352 Summer 2007 MFCC Mel-Frequency Cepstral Coefficient (MFCC) Most widely used spectral representation in ASR

3 LSA 352 Summer 2007 Windowing Why divide speech signal into successive overlapping frames? Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. Frames Frame size: typically, 10-25ms Frame shift: the length of time between successive frames, typically, 5-10ms

4 LSA 352 Summer 2007 Windowing Slide from Bryan Pellom

5 LSA 352 Summer 2007 MFCC

6 LSA 352 Summer 2007 Discrete Fourier Transform computing a spectrum A 24 ms Hamming-windowed signal And its spectrum as computed by DFT (plus other smoothing)

8 LSA 352 Summer 2007 Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear:

9 LSA 352 Summer 2007 Mel-scale A mel is a unit of pitch Definition: –Pairs of sounds perceptually equidistant in pitch  Are separated by an equal number of mels: Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz Definition:

10 LSA 352 Summer 2007 Mel Filter Bank Processing Mel Filter bank Uniformly spaced before 1 kHz logarithmic scale after 1 kHz

12 LSA 352 Summer 2007 Log energy computation Compute the logarithm of the square magnitude of the output of Mel-filter bank

13 LSA 352 Summer 2007 Log energy computation Why log energy? Logarithm compresses dynamic range of values Human response to signal level is logarithmic –humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) Phase information not helpful in speech

16 LSA 352 Summer 2007 Dynamic Cepstral Coefficient The cepstral coefficients do not capture energy So we add an energy feature Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). So we want to add the changes in features (the slopes). We call these delta features We also add double-delta acceleration features

17 LSA 352 Summer 2007 Delta and double-delta Derivative: in order to obtain temporal information

18 LSA 352 Summer 2007 Typical MFCC features Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC: 12 MFCC (mel frequency cepstral coefficients) 1 energy feature 12 delta MFCC features 12 double-delta MFCC features 1 delta energy feature 1 double-delta energy feature Total 39-dimensional features

19 LSA 352 Summer 2007 Why is MFCC so popular? Efficient to compute Incorporates a perceptual Mel frequency scale Separates the source and filter IDFT(DCT) decorrelates the features Improves diagonal assumption in HMM modeling Alternative PLP

20 LSA 352 Summer 2007 …and now what? What we have = what we OBSERVE A MATRIX OF OBSERVATIONS: O = [O 1, O 2, ……………….O T ]

21 LSA 352 Summer 2007

22 LSA 352 Summer 2007 Multivariate Gaussians Vector of means  and covariance matrix  M mixtures of Gaussians:

23 LSA 352 Summer 2007 Gaussian Intuitions: Off-diagonal As we increase the off-diagonal entries, more correlation between value of x and value of y Text and figures from Andrew Ng’s lecture notes for CS229

24 LSA 352 Summer 2007 Diagonal covariance Diagonal contains the variance of each dimension  ii 2 So this means we consider the variance of each acoustic feature (dimension) separately

25 LSA 352 Summer 2007 Gaussian Intuitions: Size of   = [0 0]  = [0 0]  = [0 0]  = I  = 0.6I  = 2I As  becomes larger, Gaussian becomes more spread out; as  becomes smaller, Gaussian more compressed Text and figures from Andrew Ng’s lecture notes for CS229

26 LSA 352 Summer 2007 Observation “probability” Text and figures from Andrew Ng’s lecture notes for CS229 How “well” and observation “fits” into a given state Example: Obs. 3 in State 2

27 LSA 352 Summer 2007 Gaussian PDFs A Gaussian is a probability density function; probability is area under curve. To make it a probability, we constrain area under curve = 1. BUT… We will be using “point estimates”; value of Gaussian at point. Technically these are not probabilities, since a pdf gives a probability over an interval, needs to be multiplied by dx BUT….this is ok since same value is omitted from all Gaussians (we will use argmax, it is still correct).

28 LSA 352 Summer 2007 Gaussian as Probability Density Function

29 LSA 352 Summer 2007 Gaussian as Probability Density Function Point estimate

30 LSA 352 Summer 2007 Using Gaussian as an acoustic likelihood estimator EXAMPLE: Let’s suppose our observation was a single real- valued feature (instead of 33 dimension vector) Then if we had learned a Gaussian over the distribution of values of this feature We could compute the likelihood of any given observation o t as follows:

31 LSA 352 Summer 2007 …what about transition probabilities?

32 LSA 352 Summer 2007 So, how to train a HMM ?? First decide the topology: number of sates, transitions,.. HMM = 

33 LSA 352 Summer 2007 In our HTK example (HMM prototypes) File name MUST be equal to the linguistic unit : si, no and sil We will use:  12 states ( 14 in HTK: 12 + 2 start/end)  and two Gaussians (mixtures) per state…  …..with dimension 33.

34 LSA 352 Summer 2007 In our HTK example (HMM prototypes) So, for example, the file named si will be: ~o 33 ~h "si" 14 2 2 1 1 0.6 330.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 2 0.4 33 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 33 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 3 2 1......................................

35 LSA 352 Summer 2007 In our HTK example (HMM prototypes)...................................... 14x14 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.6 0.4 0.0 0.0 0.0 0.0 0.0 0.0 0.0

36 LSA 352 Summer 2007 …but, how do we train a HMM ?? Train = “estimate” transition and observation probabilities (weights, means and variances) HMM =  for a particular linguistic unit

37 LSA 352 Summer 2007 …but, how do we train a HMM ?? We have some training data for each linguistic unit

38 LSA 352 Summer 2007 …but, how do we train a HMM ?? Example for a (single) Gaussian characterized by a mean and a variance

39 LSA 352 Summer 2007 …but, how do we train a HMM ??

40 LSA 352 Summer 2007 Uniform segmentation is used by HInit OUTPUT HMMs

41 LSA 352 Summer 2007 …what is the “hidden” sequence of sates ?? GIVEN:  a HMM =   and “a sequences of states ” S

42 LSA 352 Summer 2007 …what is the “hidden” sequence of sates ?? Can be easily obtained

43 LSA 352 Summer 2007 …what is the “hidden” sequence of sates ??..and by adding all possible sequences S

44 LSA 352 Summer 2007 …what is the “hidden” sequence of sates ??..and by adding all possible sequences S (Baum-Welch)

45 LSA 352 Summer 2007 …what is the “hidden” sequence of sates ??

46 LSA 352 Summer 2007 …so with the optimum sequence of states on the training data.. transition and observation probabilities (weights, means and variances) can be re-trained … UNTIL CONVERGENCE

47 LSA 352 Summer 2007 HMM training using HRest OUTPUT HMMs

48 LSA 352 Summer 2007 …training is usually done using Baum-Welch Simplifying (too much..?) adaptation formulas weight observations by their probability to occupy a particular state (occupation probability) Weighted by occupation probability and Normalize HTKBook Weighted by occupation probability and Normalize HTKBook

49 LSA 352 Summer 2007 …why not to train all HMM together? In our example, in each file we have: …like a global HMM …

50 LSA 352 Summer 2007 HMM training using HERest OUTPUT HMMs

51 LSA 352 Summer 2007 Young et al state tying

52 LSA 352 Summer 2007 Decision tree for clustering triphones for tying

53 LSA 352 Summer 2007 …and now what? For our example: TWO-WORD RECOGNITION PROBLEM Each word is represented by a HMM (also silence, but…) HMM for “si” =  HMM for “no” =  [ O 1, O 2, … ……………………….O T ] GIVEN the OBSERVATION MATRIX: IS RECOGNITION SHOULD BE BASED ON:

54 LSA 352 Summer 2007 …and now what? HMM for “si” =  HMM for “no” =  [ O 1, O 2, … ……………………….O T ] GIVEN the OBSERVATION MATRIX: IS RECOGNITION SHOULD BE BASED ON: BUT WHAT WE CAN EVALUATE IS:

55 LSA 352 Summer 2007 Bases: Bayes Theorem P(W)... Probability of pronouncing W (a priori) P(O|W)... Given W “Likelihood” of generating O P(O)... Probability for O Acoustic ModelLanguage Model Let’s call (Word) W =   THIS IS WHAT WE CAN EVALUATE THIS IS WHAT WE NEED TO DECIDE

56 LSA 352 Summer 2007 Language Models Deterministic: GRAMMAR Stocastic: N-Gram

Statistical Language Modeling Goal: compute the probability of a sentence or sequence of words: P(W) = P(w 1,w 2,w 3,w 4,w 5 …w n ) Related task: probability of an upcoming word: P(w 5 |w 1,w 2,w 3,w 4 ) A model that computes either of these: P(W) or P(w n |w 1,w 2 …w n-1 ) is called a language model.

How to compute P(W) How to compute this joint probability:  P(its, water, is, so, transparent, that) Intuition: let’s rely on the Chain Rule of Probability

Reminder: The Chain Rule Recall the definition of conditional probabilities Rewriting: More variables: P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) The Chain Rule in General P(x 1,x 2,x 3,…,x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1,x 2 )…P(x n |x 1,…,x n-1 )

The Chain Rule applied to compute joint probability of words in sentence P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)

How to estimate these probabilities Could we just count and divide? No! Too many possible sentences! We’ll never see enough data for estimating these

Markov Assumption Simplifying assumption: Or maybe Andrei Markov

Markov Assumption In other words, we approximate each component in the product

Simplest case: Unigram model fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass thrift, did, eighty, said, hard, 'm, july, bullish that, or, limited, the Some automatically generated sentences from a unigram model

Condition on the previous word: Bigram model texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen outside, new, car, parking, lot, of, the, agreement, reached this, would, be, a, record, november

N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language  because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor crashed.” But we can often get away with N-gram models

Estimating bigram probabilities The Maximum Likelihood Estimate

An example I am Sam Sam I am I do not like green eggs and ham

69 LSA 352 Summer 2007 The recognition ENGINE

1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.

Similar presentations

Presentation on theme: "1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.

Similar presentations

Presentation on theme: "1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides."— Presentation transcript:

Similar presentations

About project

Feedback