Speech Processing Speech Recognition

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Angelo Dalli Department of Intelligent Computing Systems

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Natural Language Processing - Speech Processing -

Application of HMMs: Speech recognition “Noisy channel” model of speech.

F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

COMP 4060 Natural Language Processing Speech Processing.

Why is ASR Hard? Natural speech is continuous

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 5: Acoustic Modeling with Gaussians.

CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue

Representing Acoustic Information

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.

Introduction to Automatic Speech Recognition

Isolated-Word Speech Recognition Using Hidden Markov Models

Gaussian Mixture Model and the EM algorithm in Speech Recognition

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Speech and Language Processing

Speech and Language Processing

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.

Jacob Zurasky ECE5526 – Spring 2011

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Speech recognition and the EM algorithm

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.

Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II)

1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.

CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.

CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

CS 60050: Natural Language Processing Course Speech Recognition and Synthesis - I Presented By: Pratyush Banerjee Dept. of Computer Science and Engg. IIT.

CS 224S / LINGUIST 285 Spoken Language Processing

Automatic Speech Recognition

Speech Recognition and Synthesis

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

CS 224S / LINGUIST 285 Spoken Language Processing

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

Speech Processing Text to Speech Synthesis

Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky

Hidden Markov Models Part 2: Algorithms

CS 188: Artificial Intelligence Fall 2008

Statistical Models for Automatic Speech Recognition

8-Speech Recognition Speech Recognition Concepts

EE513 Audio Signals and Systems

Speech Processing Speech Recognition

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION

CS 188: Artificial Intelligence Spring 2006

LECTURE 15: REESTIMATION, EM AND MIXTURES

Speech Recognition: Acoustic Waves

Speech recognition, machine learning

Presentation transcript:

Speech Processing Speech Recognition August 12, 2005 11/17/2018

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 11/17/2018

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 11/17/2018

LVCSR Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 11/17/2018

Speech Recognition Architecture Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 11/17/2018

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 11/17/2018

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 11/17/2018

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 11/17/2018

A quick derivation of Bayes Rule Conditionals Rearranging And also 11/17/2018

Bayes (II) We know… So rearranging things 11/17/2018

Noisy channel model likelihood prior 11/17/2018

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 11/17/2018

Speech Architecture meets Noisy Channel 11/17/2018

Five easy pieces Feature extraction Acoustic Modeling HMMs, Lexicons, and Pronunciation Decoding Language Modeling 11/17/2018

Feature Extraction Digitize Speech Extract Frames 11/17/2018

Digitizing Speech 11/17/2018

Digitizing Speech (A-D) Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 11/17/2018

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 11/17/2018

Frame Extraction . . . A frame (25 ms wide) extracted every 10 ms a1 a2 a3 Figure from Simon Arnfield 11/17/2018

MFCC (Mel Frequency Cepstral Coefficients) Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation 11/17/2018

Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 11/17/2018

Where we are Given: a sequence of acoustic feature vectors, one every 10 ms Goal: output a string of words We’ll spend 6 lectures on how to do this Rest of today: Markov Models Hidden Markov Models in the abstract Forward Algorithm Viterbi Algorithm Start of HMMs for speech 11/17/2018

Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 11/17/2018

Acoustic Modeling: MLP computes p(q|o) 11/17/2018

Gaussian Mixture Models Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 11/17/2018

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 11/17/2018

Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 11/17/2018

But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: 11/17/2018

Actually, mixture of gaussians Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 11/17/2018

Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 11/17/2018

ASR Lexicon: Markov Models for pronunciation 11/17/2018

The Hidden Markov model 11/17/2018

Formal definition of HMM States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 11/17/2018

Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 11/17/2018

Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 11/17/2018

Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 11/17/2018

A real HMM 11/17/2018

Cross-word triphones Word-Internal Context-Dependent Models ‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 11/17/2018

Summary ASR Architecture Five easy pieces of an ASR system The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 11/17/2018

Perceptual properties Pitch: perceptual correlate of frequency Loudness: perceptual correlate of power, which is related to square of amplitude 11/17/2018

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 11/17/2018

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 11/17/2018

LVCSR Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 11/17/2018

Speech Recognition Architecture Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 11/17/2018

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 11/17/2018

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 11/17/2018

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 11/17/2018

A quick derivation of Bayes Rule Conditionals Rearranging And also 11/17/2018

Bayes (II) We know… So rearranging things 11/17/2018

Noisy channel model likelihood prior 11/17/2018

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 11/17/2018

Five easy pieces Feature extraction Acoustic Modeling HMMs, Lexicons, and Pronunciation Decoding Language Modeling 11/17/2018

Feature Extraction Digitize Speech Extract Frames 11/17/2018

Digitizing Speech 11/17/2018

Digitizing Speech (A-D) Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 11/17/2018

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 11/17/2018

Frame Extraction . . . A frame (25 ms wide) extracted every 10 ms a1 a2 a3 Figure from Simon Arnfield 11/17/2018

MFCC (Mel Frequency Cepstral Coefficients) Do FFT to get spectral information Like the spectrogram/spectrum we saw earlier Apply Mel scaling Linear below 1kHz, log above, equal samples above and below 1kHz Models human ear; more sensitivity in lower freqs Plus Discrete Cosine Transformation 11/17/2018

Final Feature Vector 39 Features per 10 ms frame: 12 MFCC features 12 Delta MFCC features 12 Delta-Delta MFCC features 1 (log) frame energy 1 Delta (log) frame energy 1 Delta-Delta (log frame energy) So each frame represented by a 39D vector 11/17/2018

Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 11/17/2018

Acoustic Modeling: MLP computes p(q|o) 11/17/2018

Gaussian Mixture Models Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 11/17/2018

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 11/17/2018

Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 11/17/2018

But we need 39 gaussians, not 1! The observation o is really a vector of length 39 So need a vector of Gaussians: 11/17/2018

Actually, mixture of gaussians Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 11/17/2018

Gaussians acoustic modeling Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 11/17/2018

ASR Lexicon: Markov Models for pronunciation 11/17/2018

The Hidden Markov model 11/17/2018

Formal definition of HMM States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 11/17/2018

Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 11/17/2018

Pieces of the HMM Actually, I lied when I say states correspond to phones Actually states usually correspond to triphones CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# 11/17/2018

Pieces of the HMM Actually, I lied again when I said states correspond to triphones In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 11/17/2018

A real HMM 11/17/2018

Cross-word triphones Word-Internal Context-Dependent Models ‘OUR LIST’: SIL AA+R AA-R L+IH L-IH+S IH-S+T S-T Cross-Word Context-Dependent Models SIL-AA+R AA-R+L R-L+IH L-IH+S IH-S+T S-T+SIL 11/17/2018

Summary ASR Architecture Five easy pieces of an ASR system The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 11/17/2018