Download presentation
Presentation is loading. Please wait.
Published byThomas Ramsey Modified over 8 years ago
1
CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT Kharagpur Spring 2006
2
2 Outline for Today Speech Recognition Architectural Overview Hidden Markov Models in general Forward Viterbi Decoding HMMs for speech: structure
3
3 LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker- dependent) Continuous speech (v/s isolated-word)
4
4 LVCSR Design Intuition Build a statistical model of the speech-to- words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search
5
5 Speech Recognition Architecture
6
6 The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform.
7
7 The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o 1,o 2,o 3,…,o t Define a sentence as a sequence of words: W = w 1,w 2,w 3,…,w n
8
8 Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax:
9
9 Noisy channel model likelihoodprior
10
10 The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source)
11
11 Speech Architecture meets Noisy Channel
12
12 Architecture: Five easy pieces Feature extraction Feature extraction Acoustic Modeling Acoustic Modeling Language Modeling Language Modeling HMMs, Lexicons, and Pronunciation Decoding
13
13 Feature Extraction Digitize Speech Extract Frames
14
14 Digitizing Speech
15
15 Digitizing Speech Sampling : measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough
16
16 Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers : Raw (no header) Microsoft wav Sun.au 40 byte header
17
17 Frame Extraction A frame (25 ms wide) extracted every 10 ms 25 ms 10ms... a 1 a 2 a 3
18
18 MFCC (Mel Frequency Cepstral Coefficients) Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation
19
19 Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector Back to Architecture Slide
20
20 Acoustic Modeling Modeling the phones (CI or CD)/words using HMMs. This module takes input the feature vectors extracted from speech signal in the previous phase. Outputs a sequence of phones corresponding to input utterance.
21
21 Acoustic Modeling – A bit of history DARPA program (1971-76) Input separated into phonemes using heuristics Strings of phonemes replaced with word candidates Sequences of words scored by heuristics Lots of hand-written rule IBM research on ASR (1972-84) Idea of HMMHMM Idea to eliminate hard decisions about phones: instead, frame-based and soft decisions Idea to capture all language information by simple sequences of bigram/trigram rather than hand- constructed grammars
22
22 HMMs for speech
23
23 But phones aren’t homogeneous
24
24 So we’ll need to break phones into sub-phones
25
25 Now a word looks like this:
26
26 Acoustic Modeling - Techniques Input feature vectors are real valued. Output phones are discreet. (Each alphabet has a finite number of symbols/phones) How to map from one to the other ? Vector Quantization + Baum Welch Gaussian pdf + Baum Welch Univariate Multivariate Mixture models Use Viterbi search to find out the sequence of phones (as output)
27
27 Vector Quantization - Idea Idea: Make MFCC vectors look like symbols that we can count By building a mapping function that maps each input vector into one of a small number of symbols Then compute probabilities just by counting Not used for ASR any more; too simple But is useful to consider as a starting point.
28
28 Vector Quantization - Method Create a training set of feature vectors Cluster them into a small number of classes Represent each class by a discrete symbol For each class v k, we can compute the probability that it is generated by a given HMM state using Baum-Welch
29
29 Vector Quantization We’ll define a Codebook, which lists for each symbol A prototype vector, or codeword If we had 256 classes (‘8-bit VQ’), A codebook with 256 prototype vectors Given an incoming feature vector, we compare it to each of the 256 prototype vectors We pick whichever one is closest (by some ‘distance metric’) And replace the input vector by the index of this prototype vector
30
30 Vector Quantization
31
31 Vector Quantization- Summary To compute p(o t |q j ) Compute distance between feature vector o t and each codeword (prototype vector) in a preclustered codebook Choose the vector that is the closest to o t and take its codeword v k And then look up the likelihood of v k given HMM state j in the B matrix B j (o t )=b j (v k ) s.t. v k is codeword of closest vector to o t Using Baum-Welch
32
32 Directly Modeling Continuous Observations VQ is insufficient for real ASR Instead: Assume the possible values of the observation feature vector o t are normally distributed. Represent the observation likelihood function b j (o t ) as a Gaussian with mean j and variance j 2
33
33 But we’re not there yet Single Gaussian may do a bad job of modeling distribution in any dimension: Solution: Mixtures of Gaussians Back to Architecture Slide
34
34 Language Modeling The noisy channel model expects P(W); the probability of the sentence The model that computes P(W) is called the language model. A better term for this would be “The Grammar” But “Language model” or LM is standard
35
35 Computing P(W) How to compute this joint probability: P(“the”,”other”,”day”,”I”,”was”,”walking”,”along ”,”and”,”saw”,”a”,”lizard”) Intuition: let’s rely on the Chain Rule of Probability
36
36 The Chain Rule Recall the definition of conditional probabilities Rewriting: In general P(x 1,x 2,x 3,…x n ) = P(x 1 )P(x 2 |x 1 )P(x 3 |x 1,x 2 )…P(x n |x 1 …x n-1 )
37
37 The Chain Rule Applied to joint probability of words in sentence P(“the big red dog was”)= P(the)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog)
38
38 Unfortunately Chomsky dictum: “Language is creative” We’ll never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,a nd,saw,a)
39
39 Markov Assumption Make the simplifying assumption P(lizard|the,other,day,I,was,walking,along,and,s aw,a) = P(lizard|a) Or maybe P(lizard|the,other,day,I,was,walking,along,and,s aw,a) = P(lizard|saw,a)
40
40 So for each component in the product replace with the approximation (assuming a prefix of N) Bigram version Markov Assumption
41
41 Estimating bigram probabilities The Maximum Likelihood Estimate
42
42 Examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i ’ m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i ’ m looking for a good place to eat breakfast when is caffe venezia open during the day
43
43 Raw bigram counts Out of 9222 sentences
44
44 Raw bigram probabilities Normalize by unigrams: Result:
45
45 Bigram estimates of sentence probabilities P( I want english food ) = p(i| ) x p(want|I) x p(english|want) x p(food|english) x p( |food) =.000031
46
46 What kinds of knowledge? P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25
47
47 Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V 2 = 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams worse: What's coming out looks like Shakespeare because it is Shakespeare
48
48 Lesson 1: the perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn’t We need to train robust models, adapt to test set, etc
49
49 Lesson 2: zeros or not? Zipf’s Law: A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Answer: Estimate the likelihood of unseen N-grams!
50
50 Smoothing is like Robin Hood
51
51 Smoothing Techniques Add-one or Laplace smoothing But add-one smoothing not used for N- grams, as we have much better methods Despite its flaws it is however still used to smooth other probabilistic models in NLP, especially For pilot studies in domains where the number of zeros isn’t so huge.
52
52 Better discounting algorithms Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Is to use the count of things we’ve seen once to help estimate the count of things we’ve never seen
53
53 Backoff and Interpolation Another really useful source of knowledge If we are estimating: trigram p(z|xy) but c(xyz) is zero Use info from: Bigram p(z|y) Or even: Unigram p(z) How to combine the trigram/bigram/unigram info?
54
54 Backoff versus interpolation Backoff: use trigram if you have it, otherwise bigram, otherwise unigram Interpolation: mix all three
55
55 OOV words: word Out Of Vocabulary = OOV words We don’t use GT smoothing for these Because GT assumes we know the number of unseen events Instead: create an unknown word token Training of probabilities Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to Now we train its probabilities like a normal word At decoding time If text input: Use UNK probabilities for any word not in training
56
56 Practical Issues We do everything in log space Avoid underflow (also adding is faster than multiplying)
57
57 Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/downl oad.html http://www.speech.sri.com/projects/srilm/downl oad.html CMU-Cambridge LM Toolkit http://htk.eng.cam.ac.uk/links/asr_tool.shtml
58
58 Evaluating N-gram models Best evaluation for an N-gram Put model A in a speech recognizer Run recognition, get WER for A Put model B in speech recognition, get WER for B Compare WER for A and B But This is really time-consuming Can take days to run an experiment
59
59 Evaluating N-gram models So As a temporary solution, in order to run experiments To evaluate N-grams we often use a (poor) approximation called perplexity But perplexity is a poor approximation unless the test data looks just like the training data So is generally only useful in pilot experiments (generally is not sufficient to publish)
60
60 Perplexity Perplexity is technically “the exp of the cross-entropy of the model on the test data” SMB: if we have a model M = P(W):
61
61 Perplexity: example Minimizing perplexity = maximizing probability of test set according to model Example: Perplexity of N-grams V = 20K Training = 38M words Test = 1.5 M words
62
62 Advanced LM stuff Current best smoothing algorithm Kneser-Ney smoothing Other stuff Variable-length n-grams Class-based n-grams Clustering Hand-built classes Cache LMs Topic-based LMs Sentence mixture models Skipping LMs Parser-based LMs Back to Architecture Slide
63
63 Hidden Markov Models a set of states Q = q 1, q 2 …q N; the state at time t is q t Transition probability matrix A = {a ij } Output probability matrix B={b i (k)} Special initial probability vector Constraints:
64
64 Assumptions Markov assumption: Output-independence assumption
65
65 HMM for Dow Jones
66
66 The Three Basic Problems for HMMs Problem 1 (Evaluation) : Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model = (A,B, ), how do we efficiently compute P(O| ), the probability of the observation sequence, given the model Problem 2 (Decoding) : Given the observation sequence O=(o 1 o 2 …o T ), and an HMM model = (A,B, ), how do we choose a corresponding state sequence Q=(q 1 q 2 …q T ) that is optimal in some sense (i.e., best explains the observations) Problem 3 (Learning) : How do we adjust the model parameters = (A,B, ) to maximize P(O| )?
67
67 The Evaluation Problem Given observation sequence O and HMM , compute P(O| ) Why is this hard? Sum over all possible sequences of states! q0 q2 q1 o1o2o3o4oT q0 q2 q1 q0 q2 q1 q0 q2 q1 P(o1o2o3|q0q0q0) + P(o1o2o3|q0q0q1) + P(o1o2o3|q0q1q2) + P(o1o2o3|q0q1q0) …
68
68 Computing observation likelihood P(O| ) Why can’t we do an explicit sum over all paths? Because it’s intractable. O(N T ) What we do instead: The Forward Algorithm. O(N 2 T)
69
69 The Forward Algorithm
70
70 The Decoding Problem Given observations O=(o1o2…oT), and HMM =(A,B, ), how do we choose best state sequence Q=(q1,q2…qt)? The forward algorithm computes P(O|W) Could find best W by running forward algorithm for each W in L, picking W maximizing P(O|W) But we can’t do this, since number of sentences is O(W T ). Instead: Viterbi Decoding: dynamic programming, slight modification of the forward algorithm A* Decoding: search the space of all possible sentences using the forward algorithm as a subroutine.
71
71 The Viterbi Algorithm
72
72 The Viterbi Algorithm
73
73 The Learning Problem: Baum-Welch Baum-Welch = Forward-Backward Algorithm (Baum 1972) Is a special case of the EM or Expectation- Maximization algorithm (Dempster, Laird, Rubin) The algorithm will let us train the transition probabilities A= {a ij } and the emission probabilities B={b i (o t )} of the HMM
74
74 Summary: Baum-Welch Algorithm 1) Initialize =(A,B, ) 2) Compute , , 3) Estimate new ’=(A,B, ) 4) Replace with ’ 5) If not converged go to 2 Back to Acoustic Modeling
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.