CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT Kharagpur Spring 2006

2 Outline for Today  Speech Recognition Architectural Overview  Hidden Markov Models in general Forward Viterbi Decoding  HMMs for speech: structure

3 LVCSR  Large Vocabulary Continuous Speech Recognition  ~20,000-64,000 words  Speaker independent (vs. speaker- dependent)  Continuous speech (v/s isolated-word)

4 LVCSR Design Intuition  Build a statistical model of the speech-to- words process  Collect lots and lots of speech, and transcribe all the words.  Train the model on the labeled speech  Paradigm: Supervised Machine Learning + Search

5 Speech Recognition Architecture

6 The Noisy Channel Model  Search through space of all possible sentences.  Pick the one that is most probable given the waveform.

7 The Noisy Channel Model (II)  What is the most likely sentence out of all sentences in the language L given some acoustic input O?  Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3,…,o t  Define a sentence as a sequence of words: W = w 1,w 2,w 3,…,w n

8 Noisy Channel Model (III)  Probabilistic implication: Pick the highest prob S:  We can use Bayes rule to rewrite this:  Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:

9 Noisy channel model likelihoodprior

10 The noisy channel model  Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)

11 Speech Architecture meets Noisy Channel

12 Architecture: Five easy pieces  Feature extraction Feature extraction  Acoustic Modeling Acoustic Modeling  Language Modeling Language Modeling  HMMs, Lexicons, and Pronunciation  Decoding

13 Feature Extraction  Digitize Speech  Extract Frames

14 Digitizing Speech

15 Digitizing Speech  Sampling : measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why?  Need at least 2 samples per cycle  max measurable frequency is half sampling rate  Human speech < 10,000 Hz, so need max 20K  Telephone filtered at 4K, so 8K is enough

16 Digitizing Speech (II)  Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767)  Formats: 16 bit PCM 8 bit mu-law; log compression  LSB (Intel) vs. MSB (Sun, Apple)  Headers : Raw (no header) Microsoft wav Sun.au 40 byte header

17 Frame Extraction  A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3

18 MFCC (Mel Frequency Cepstral Coefficients)  Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier  Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs  Plus Discrete Cosine Transformation

19 Final Feature Vector  39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy)  So each frame represented by a 39D vector Back to Architecture Slide

20 Acoustic Modeling  Modeling the phones (CI or CD)/words using HMMs.  This module takes input the feature vectors extracted from speech signal in the previous phase.  Outputs a sequence of phones corresponding to input utterance.

21 Acoustic Modeling – A bit of history  DARPA program (1971-76) Input separated into phonemes using heuristics Strings of phonemes replaced with word candidates Sequences of words scored by heuristics Lots of hand-written rule  IBM research on ASR (1972-84) Idea of HMMHMM Idea to eliminate hard decisions about phones: instead, frame-based and soft decisions Idea to capture all language information by simple sequences of bigram/trigram rather than hand- constructed grammars

22 HMMs for speech

23 But phones aren’t homogeneous

24 So we’ll need to break phones into sub-phones

25 Now a word looks like this:

26 Acoustic Modeling - Techniques  Input feature vectors are real valued.  Output phones are discreet. (Each alphabet has a finite number of symbols/phones)  How to map from one to the other ? Vector Quantization + Baum Welch Gaussian pdf + Baum Welch  Univariate  Multivariate  Mixture models  Use Viterbi search to find out the sequence of phones (as output)

27 Vector Quantization - Idea  Idea: Make MFCC vectors look like symbols that we can count  By building a mapping function that maps each input vector into one of a small number of symbols  Then compute probabilities just by counting  Not used for ASR any more; too simple  But is useful to consider as a starting point.

28 Vector Quantization - Method  Create a training set of feature vectors  Cluster them into a small number of classes  Represent each class by a discrete symbol  For each class v k, we can compute the probability that it is generated by a given HMM state using Baum-Welch

29 Vector Quantization  We’ll define a Codebook, which lists for each symbol A prototype vector, or codeword  If we had 256 classes (‘8-bit VQ’), A codebook with 256 prototype vectors Given an incoming feature vector, we compare it to each of the 256 prototype vectors We pick whichever one is closest (by some ‘distance metric’) And replace the input vector by the index of this prototype vector

30 Vector Quantization

31 Vector Quantization- Summary  To compute p(o t |q j ) Compute distance between feature vector o t  and each codeword (prototype vector)  in a preclustered codebook Choose the vector that is the closest to o t  and take its codeword v k And then look up the likelihood of v k given HMM state j in the B matrix  B j (o t )=b j (v k ) s.t. v k is codeword of closest vector to o t Using Baum-Welch

32 Directly Modeling Continuous Observations  VQ is insufficient for real ASR  Instead: Assume the possible values of the observation feature vector o t are normally distributed.  Represent the observation likelihood function b j (o t ) as a Gaussian with mean  j and variance  j 2

33 But we’re not there yet  Single Gaussian may do a bad job of modeling distribution in any dimension:  Solution: Mixtures of Gaussians Back to Architecture Slide

34 Language Modeling  The noisy channel model expects P(W); the probability of the sentence  The model that computes P(W) is called the language model.  A better term for this would be “The Grammar”  But “Language model” or LM is standard

35 Computing P(W)  How to compute this joint probability: P(“the”,”other”,”day”,”I”,”was”,”walking”,”along ”,”and”,”saw”,”a”,”lizard”)  Intuition: let’s rely on the Chain Rule of Probability

36 The Chain Rule  Recall the definition of conditional probabilities  Rewriting:  In general P(x 1,x 2,x 3,…x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1,x 2 )…P(x n |x 1 …x n-1 )

37 The Chain Rule Applied to joint probability of words in sentence  P(“the big red dog was”)= P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)

38 Unfortunately  Chomsky dictum: “Language is creative”  We’ll never be able to get enough data to compute the statistics for those long prefixes  P(lizard|the,other,day,I,was,walking,along,a nd,saw,a)

39 Markov Assumption  Make the simplifying assumption P(lizard|the,other,day,I,was,walking,along,and,s aw,a) = P(lizard|a)  Or maybe P(lizard|the,other,day,I,was,walking,along,and,s aw,a) = P(lizard|saw,a)

40  So for each component in the product replace with the approximation (assuming a prefix of N)  Bigram version Markov Assumption

41 Estimating bigram probabilities  The Maximum Likelihood Estimate

42 Examples: Berkeley Restaurant Project sentences  can you tell me about any good cantonese restaurants close by  mid priced thai food is what i ’ m looking for  tell me about chez panisse  can you give me a listing of the kinds of food that are available  i ’ m looking for a good place to eat breakfast  when is caffe venezia open during the day

43 Raw bigram counts  Out of 9222 sentences

44 Raw bigram probabilities  Normalize by unigrams:  Result:

47 Shakespeare as corpus  N=884,647 tokens, V=29,066  Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table)  Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare

48 Lesson 1: the perils of overfitting  N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t We need to train robust models, adapt to test set, etc

49 Lesson 2: zeros or not?  Zipf’s Law: A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events  Result: Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address?  Answer: Estimate the likelihood of unseen N-grams!

50 Smoothing is like Robin Hood

51 Smoothing Techniques  Add-one or Laplace smoothing  But add-one smoothing not used for N- grams, as we have much better methods  Despite its flaws it is however still used to smooth other probabilistic models in NLP, especially For pilot studies in domains where the number of zeros isn’t so huge.

52 Better discounting algorithms  Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell  Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen

53 Backoff and Interpolation  Another really useful source of knowledge  If we are estimating: trigram p(z|xy) but c(xyz) is zero  Use info from: Bigram p(z|y)  Or even: Unigram p(z)  How to combine the trigram/bigram/unigram info?

54 Backoff versus interpolation  Backoff: use trigram if you have it, otherwise bigram, otherwise unigram  Interpolation: mix all three

55 OOV words: word  Out Of Vocabulary = OOV words  We don’t use GT smoothing for these Because GT assumes we know the number of unseen events  Instead: create an unknown word token Training of probabilities  Create a fixed lexicon L of size V  At text normalization phase, any training word not in L changed to  Now we train its probabilities like a normal word At decoding time  If text input: Use UNK probabilities for any word not in training

56 Practical Issues  We do everything in log space Avoid underflow (also adding is faster than multiplying)

57 Language Modeling Toolkits  SRILM http://www.speech.sri.com/projects/srilm/downl oad.html http://www.speech.sri.com/projects/srilm/downl oad.html  CMU-Cambridge LM Toolkit http://htk.eng.cam.ac.uk/links/asr_tool.shtml

58 Evaluating N-gram models  Best evaluation for an N-gram Put model A in a speech recognizer Run recognition, get WER for A Put model B in speech recognition, get WER for B Compare WER for A and B  But This is really time-consuming Can take days to run an experiment

59 Evaluating N-gram models  So As a temporary solution, in order to run experiments To evaluate N-grams we often use a (poor) approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish)

60 Perplexity  Perplexity is technically “the exp of the cross-entropy of the model on the test data”  SMB: if we have a model M = P(W):

61 Perplexity: example  Minimizing perplexity = maximizing probability of test set according to model  Example: Perplexity of N-grams V = 20K Training = 38M words Test = 1.5 M words

62 Advanced LM stuff  Current best smoothing algorithm Kneser-Ney smoothing  Other stuff Variable-length n-grams Class-based n-grams  Clustering  Hand-built classes Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs Back to Architecture Slide

63 Hidden Markov Models  a set of states Q = q 1, q 2 …q N; the state at time t is q t  Transition probability matrix A = {a ij }  Output probability matrix B={b i (k)}  Special initial probability vector   Constraints:

64 Assumptions  Markov assumption:  Output-independence assumption

65 HMM for Dow Jones

66 The Three Basic Problems for HMMs  Problem 1 (Evaluation) : Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model  = (A,B,  ), how do we efficiently compute P(O|  ), the probability of the observation sequence, given the model  Problem 2 (Decoding) : Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model  = (A,B,  ), how do we choose a corresponding state sequence Q=(q 1 q 2 …q T ) that is optimal in some sense (i.e., best explains the observations)  Problem 3 (Learning) : How do we adjust the model parameters  = (A,B,  ) to maximize P(O|  )?

67 The Evaluation Problem  Given observation sequence O and HMM , compute P(O|  )  Why is this hard? Sum over all possible sequences of states! q0 q2 q1 o1o2o3o4oT q0 q2 q1 q0 q2 q1 q0 q2 q1 P(o1o2o3|q0q0q0) + P(o1o2o3|q0q0q1) + P(o1o2o3|q0q1q2) + P(o1o2o3|q0q1q0) …

68 Computing observation likelihood P(O|  )  Why can’t we do an explicit sum over all paths?  Because it’s intractable. O(N T )  What we do instead:  The Forward Algorithm. O(N 2 T)

69 The Forward Algorithm

70 The Decoding Problem  Given observations O=(o1o2…oT), and HMM  =(A,B,  ), how do we choose best state sequence Q=(q1,q2…qt)?  The forward algorithm computes P(O|W)  Could find best W by running forward algorithm for each W in L, picking W maximizing P(O|W)  But we can’t do this, since number of sentences is O(W T ). Instead: Viterbi Decoding: dynamic programming, slight modification of the forward algorithm A* Decoding: search the space of all possible sentences using the forward algorithm as a subroutine.

71 The Viterbi Algorithm

72 The Viterbi Algorithm

73 The Learning Problem: Baum-Welch  Baum-Welch = Forward-Backward Algorithm (Baum 1972)  Is a special case of the EM or Expectation- Maximization algorithm (Dempster, Laird, Rubin)  The algorithm will let us train the transition probabilities A= {a ij } and the emission probabilities B={b i (o t )} of the HMM

74 Summary: Baum-Welch Algorithm 1) Initialize  =(A,B,  ) 2) Compute , ,  3) Estimate new  ’=(A,B,  ) 4) Replace  with  ’ 5) If not converged go to 2 Back to Acoustic Modeling

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

Similar presentations

Presentation on theme: "CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

Similar presentations

Presentation on theme: "CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT."— Presentation transcript:

Similar presentations

About project

Feedback