Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Processing Speech Recognition

Similar presentations


Presentation on theme: "Speech Processing Speech Recognition"— Presentation transcript:

1 Speech Processing Speech Recognition
August 12, 2005 11/17/2018

2 Speech Recognition Applications of Speech Recognition (ASR) Dictation
Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 11/17/2018

3 LVCSR Large Vocabulary Continuous Speech Recognition
~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 11/17/2018

4 LVCSR Design Intuition
Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 11/17/2018

5 Speech Recognition Architecture
Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 11/17/2018

6 The Noisy Channel Model
Search through space of all possible sentences. Pick the one that is most probable given the waveform. 11/17/2018

7 The Noisy Channel Model (II)
What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 11/17/2018

8 Noisy Channel Model (III)
Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 11/17/2018

9 A quick derivation of Bayes Rule
Conditionals Rearranging And also 11/17/2018

10 Bayes (II) We know… So rearranging things 11/17/2018

11 Noisy channel model likelihood prior 11/17/2018

12 The noisy channel model
Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 11/17/2018

13 Speech Architecture meets Noisy Channel
11/17/2018

14 Five easy pieces Feature extraction Acoustic Modeling
HMMs, Lexicons, and Pronunciation Decoding Language Modeling 11/17/2018

15 Feature Extraction Digitize Speech Extract Frames 11/17/2018

16 Digitizing Speech 11/17/2018

17 Digitizing Speech (A-D)
Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 11/17/2018

18 Digitizing Speech (II)
Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit ( to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 11/17/2018

19 Frame Extraction . . . A frame (25 ms wide) extracted every 10 ms
a a a3 Figure from Simon Arnfield 11/17/2018

20 MFCC (Mel Frequency Cepstral Coefficients)
Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation 11/17/2018

21 Final Feature Vector 39 Features per 10 ms frame:
12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 11/17/2018

22 Where we are Given: a sequence of acoustic feature vectors, one every 10 ms Goal: output a string of words We’ll spend 6 lectures on how to do this Rest of today: Markov Models Hidden Markov Models in the abstract Forward Algorithm Viterbi Algorithm Start of HMMs for speech 11/17/2018

23 Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 11/17/2018

24 Acoustic Modeling: MLP computes p(q|o)
11/17/2018

25 Gaussian Mixture Models
Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 11/17/2018

26 Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 11/17/2018

27 Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 11/17/2018

28 But we need 39 gaussians, not 1!
The observation o is really a vector of length 39 So need a vector of Gaussians: 11/17/2018

29 Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 11/17/2018

30 Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 11/17/2018

31 ASR Lexicon: Markov Models for pronunciation
11/17/2018

32 The Hidden Markov model
11/17/2018

33 Formal definition of HMM
States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 11/17/2018

34 Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 11/17/2018

35 Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 11/17/2018

36 Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 11/17/2018

37 A real HMM 11/17/2018

38 Cross-word triphones Word-Internal Context-Dependent Models
‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 11/17/2018

39 Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 11/17/2018

40 Perceptual properties
Pitch: perceptual correlate of frequency Loudness: perceptual correlate of power, which is related to square of amplitude 11/17/2018

41 Speech Recognition Applications of Speech Recognition (ASR) Dictation
Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 11/17/2018

42 LVCSR Large Vocabulary Continuous Speech Recognition
~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 11/17/2018

43 LVCSR Design Intuition
Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 11/17/2018

44 Speech Recognition Architecture
Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 11/17/2018

45 The Noisy Channel Model
Search through space of all possible sentences. Pick the one that is most probable given the waveform. 11/17/2018

46 The Noisy Channel Model (II)
What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 11/17/2018

47 Noisy Channel Model (III)
Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 11/17/2018

48 A quick derivation of Bayes Rule
Conditionals Rearranging And also 11/17/2018

49 Bayes (II) We know… So rearranging things 11/17/2018

50 Noisy channel model likelihood prior 11/17/2018

51 The noisy channel model
Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 11/17/2018

52 Five easy pieces Feature extraction Acoustic Modeling
HMMs, Lexicons, and Pronunciation Decoding Language Modeling 11/17/2018

53 Feature Extraction Digitize Speech Extract Frames 11/17/2018

54 Digitizing Speech 11/17/2018

55 Digitizing Speech (A-D)
Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 11/17/2018

56 Digitizing Speech (II)
Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit ( to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 11/17/2018

57 Frame Extraction . . . A frame (25 ms wide) extracted every 10 ms
a a a3 Figure from Simon Arnfield 11/17/2018

58 MFCC (Mel Frequency Cepstral Coefficients)
Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation 11/17/2018

59 Final Feature Vector 39 Features per 10 ms frame:
12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 11/17/2018

60 Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 11/17/2018

61 Acoustic Modeling: MLP computes p(q|o)
11/17/2018

62 Gaussian Mixture Models
Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 11/17/2018

63 Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 11/17/2018

64 Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 11/17/2018

65 But we need 39 gaussians, not 1!
The observation o is really a vector of length 39 So need a vector of Gaussians: 11/17/2018

66 Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 11/17/2018

67 Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 11/17/2018

68 ASR Lexicon: Markov Models for pronunciation
11/17/2018

69 The Hidden Markov model
11/17/2018

70 Formal definition of HMM
States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 11/17/2018

71 Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 11/17/2018

72 Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 11/17/2018

73 Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 11/17/2018

74 A real HMM 11/17/2018

75 Cross-word triphones Word-Internal Context-Dependent Models
‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 11/17/2018

76 Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 11/17/2018


Download ppt "Speech Processing Speech Recognition"

Similar presentations


Ads by Google