Speech and Language Processing

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Digital Signal Processing
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 3: ASR: HMMs, Forward, Viterbi.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
CS 224S / LINGUIST 285 Spoken Language Processing
Natural Language Processing - Speech Processing -
LSA 352 Speech Recognition and Synthesis
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
COMP 4060 Natural Language Processing Speech Processing.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
CS 224S / LINGUIST 281 Speech Recognition, Synthesis, and Dialogue
Representing Acoustic Information
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Lecture 1 Signals in the Time and Frequency Domains
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
CS 124/LINGUIST 180: From Languages to Information
CSC361/661 Digital Media Spring 2002
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Speech and Language Processing
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Jacob Zurasky ECE5526 – Spring 2011
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
CS 188: Artificial Intelligence Spring 2007 Speech Recognition 03/20/2007 Srini Narayanan – ICSI and UC Berkeley.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
7.0 Speech Signals and Front-end Processing References: , 3.4 of Becchetti of Huang.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
CS 224S / LINGUIST 285 Spoken Language Processing
ARTIFICIAL NEURAL NETWORKS
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky
Speech Processing Speech Recognition
Statistical Models for Automatic Speech Recognition
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Digital Systems: Hardware Organization and Design
CS 188: Artificial Intelligence Spring 2006
Presentation transcript:

Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (I)

Outline for ASR ASR Architecture Five easy pieces of an ASR system The Noisy Channel Model Five easy pieces of an ASR system Language Model Lexicon/Pronunciation Model (HMM) Feature Extraction Acoustic Model Decoder Training Evaluation 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Recognition Applications of Speech Recognition (ASR) Dictation Telephone-based Information (directions, air travel, banking, etc) Hands-free (in car) Speaker Identification Language Identification Second language ('L2') (accent reduction) Audio archive searching 4/22/2017 Speech and Language Processing Jurafsky and Martin

LVCSR Large Vocabulary Continuous Speech Recognition ~20,000-64,000 words Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Current error rates Task Vocabulary Error Rate% Digits 11 0.5 Ballpark numbers; exact numbers depend very much on the specific corpus Task Vocabulary Error Rate% Digits 11 0.5 WSJ read speech 5K 3 20K Broadcast news 64,000+ 10 Conversational Telephone 20 4/22/2017 Speech and Language Processing Jurafsky and Martin

HSR versus ASR Conclusions: Machines about 5 times worse than humans Task Vocab ASR Hum SR Continuous digits 11 .5 .009 WSJ 1995 clean 5K 3 0.9 WSJ 1995 w/noise 9 1.1 SWBD 2004 65K 20 4 Conclusions: Machines about 5 times worse than humans Gap increases with noisy speech These numbers are rough, take with grain of salt 4/22/2017 Speech and Language Processing Jurafsky and Martin

Why is conversational speech harder? A piece of an utterance without context The same utterance with more context 4/22/2017 Speech and Language Processing Jurafsky and Martin

LVCSR Design Intuition Build a statistical model of the speech-to-words process Collect lots and lots of speech, and transcribe all the words. Train the model on the labeled speech Paradigm: Supervised Machine Learning + Search 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Recognition Architecture 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Noisy Channel Model Search through space of all possible sentences. Pick the one that is most probable given the waveform. 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Noisy Channel Model (II) What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn 4/22/2017 Speech and Language Processing Jurafsky and Martin

Noisy Channel Model (III) Probabilistic implication: Pick the highest prob S = W: We can use Bayes rule to rewrite this: Since denominator is the same for each candidate sentence W, we can ignore it for the argmax: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Noisy channel model likelihood prior 4/22/2017 Speech and Language Processing Jurafsky and Martin

The noisy channel model Ignoring the denominator leaves us with two factors: P(Source) and P(Signal|Source) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Speech Architecture meets Noisy Channel 4/22/2017 Speech and Language Processing Jurafsky and Martin

Architecture: Five easy pieces (only 3-4 for today) HMMs, Lexicons, and Pronunciation Feature extraction Acoustic Modeling Decoding Language Modeling (seen this already) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Lexicon A list of words Each one with a pronunciation in terms of phones We get these from on-line pronucniation dictionary CMU dictionary: 127K words http://www.speech.cs.cmu.edu/cgi-bin/cmudict We’ll represent the lexicon as an HMM 4/22/2017 Speech and Language Processing Jurafsky and Martin

HMMs for speech: the word “six” 4/22/2017 Speech and Language Processing Jurafsky and Martin

Phones are not homogeneous! 4/22/2017 Speech and Language Processing Jurafsky and Martin

Each phone has 3 subphones 4/22/2017 Speech and Language Processing Jurafsky and Martin

Resulting HMM word model for “six” with their subphones 4/22/2017 Speech and Language Processing Jurafsky and Martin

HMM for the digit recognition task 4/22/2017 Speech and Language Processing Jurafsky and Martin

Detecting Phones Two stages Feature extraction Phone classification Basically a slice of a spectrogram Phone classification Using GMM classifier 4/22/2017 Speech and Language Processing Jurafsky and Martin

Discrete Representation of Signal Represent continuous signal into discrete form. 4/22/2017 Thanks to Bryan Pellom for this slide Speech and Language Processing Jurafsky and Martin

Digitizing the signal (A-D) Sampling: measuring amplitude of signal at time t 16,000 Hz (samples/sec) Microphone (“Wideband”): 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle max measurable frequency is half sampling rate Human speech < 10,000 Hz, so need max 20K Telephone filtered at 4K, so 8K is enough 4/22/2017 Speech and Language Processing Jurafsky and Martin

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header 4/22/2017 Speech and Language Processing Jurafsky and Martin

Discrete Representation of Signals Byte swapping Little-endian vs. Big-endian Some audio formats have headers Headers contain meta-information such as sampling rates, recording condition Raw file refers to 'no header' Example: Microsoft wav, Nist sphere Nice sound manipulation tool: sox. change sampling rate convert speech formats 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC: Mel-Frequency Cepstral Coefficients 4/22/2017 Speech and Language Processing Jurafsky and Martin

Pre-Emphasis Pre-emphasis: boosting the energy in the high frequencies Q: Why do this? A: The spectrum for voiced segments has more energy at lower frequencies than higher frequencies. This is called spectral tilt Spectral tilt is caused by the nature of the glottal pulse Boosting high-frequency energy gives more info to Acoustic Model Improves phone recognition performance 4/22/2017 Speech and Language Processing Jurafsky and Martin

Example of pre-emphasis Before and after pre-emphasis Spectral slice from the vowel [aa] 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC process: windowing 4/22/2017 Speech and Language Processing Jurafsky and Martin

Why divide speech signal into successive overlapping frames? Windowing Why divide speech signal into successive overlapping frames? Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. Frames Frame size: typically, 10-25ms Frame shift: the length of time between successive frames, typically, 5-10ms 4/22/2017 Speech and Language Processing Jurafsky and Martin

MFCC process: windowing 4/22/2017 Speech and Language Processing Jurafsky and Martin

Common window shapes Rectangular window: Hamming window 4/22/2017 Speech and Language Processing Jurafsky and Martin

Discrete Fourier Transform Input: Windowed signal x[n]…x[m] Output: For each of N discrete frequency bands A complex number X[k] representing magnidue and phase of that frequency component in the original signal Discrete Fourier Transform (DFT) Standard algorithm for computing DFT: Fast Fourier Transform (FFT) with complexity N*log(N) In general, choose N=512 or 1024 4/22/2017 Speech and Language Processing Jurafsky and Martin

Discrete Fourier Transform computing a spectrum A 25 ms Hamming-windowed signal from [iy] And its spectrum as computed by DFT (plus other smoothing) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Mel-scale A mel is a unit of pitch Pairs of sounds perceptually equidistant in pitch Are separated by an equal number of mels Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz Definition: 4/22/2017 Speech and Language Processing Jurafsky and Martin

Mel Filter Bank Processing Uniformly spaced before 1 kHz logarithmic scale after 1 kHz 4/22/2017 Speech and Language Processing Jurafsky and Martin

Log energy computation Log of the square magnitude of the output of the mel filterbank Why log? Logarithm compresses dynamic range of values Human response to signal level is logarithmic humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) Why square? Phase information not helpful in speech 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Cepstrum One way to think about this Articulatory facts: Separating the source and filter Speech waveform is created by A glottal source waveform Passes through a vocal tract which because of its shape has a particular filtering characteristic Articulatory facts: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of oral cavity, some harmonics are amplified more than others 4/22/2017 Speech and Language Processing Jurafsky and Martin

Vocal Fold Vibration UCLA Phonetics Lab Demo 4/22/2017 Speech and Language Processing Jurafsky and Martin

George Miller figure 4/22/2017 Speech and Language Processing Jurafsky and Martin

We care about the filter not the source Most characteristics of the source F0 Details of glottal pulse Don’t matter for phone detection What we care about is the filter The exact position of the articulators in the oral tract So we want a way to separate these And use only the filter function 4/22/2017 Speech and Language Processing Jurafsky and Martin

The Cepstrum The spectrum of the log of the spectrum Spectrum Log spectrum Spectrum of log spectrum 4/22/2017 Speech and Language Processing Jurafsky and Martin

Thinking about the Cepstrum 4/22/2017 Pictures from John Coleman (2005) Speech and Language Processing Jurafsky and Martin

Mel Frequency cepstrum The cepstrum requires Fourier analysis But we’re going from frequency space back to time So we actually apply inverse DFT Details for signal processing gurus: Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Another advantage of the Cepstrum DCT produces highly uncorrelated features We’ll see when we get to acoustic modeling that these will be much easier to model than the spectrum Simply modelled by linear combinations of Gaussian density functions with diagonal covariance matrices In general we’ll just use the first 12 cepstral coefficients (we don’t want the later ones which have e.g. the F0 spike) 4/22/2017 Speech and Language Processing Jurafsky and Martin

Dynamic Cepstral Coefficient The cepstral coefficients do not capture energy So we add an energy feature Also, we know that speech signal is not constant (slope of formants, change from stop burst to release). So we want to add the changes in features (the slopes). We call these delta features We also add double-delta acceleration features 4/22/2017 Speech and Language Processing Jurafsky and Martin

Typical MFCC features Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC: 12 MFCC (mel frequency cepstral coefficients) 1 energy feature 12 delta MFCC features 12 double-delta MFCC features 1 delta energy feature 1 double-delta energy feature Total 39-dimensional features 4/22/2017 Speech and Language Processing Jurafsky and Martin

Why is MFCC so popular? Efficient to compute Incorporates a perceptual Mel frequency scale Separates the source and filter IDFT(DCT) decorrelates the features Improves diagonal assumption in HMM modeling Alternative PLP 4/22/2017 Speech and Language Processing Jurafsky and Martin

Next Time: Acoustic Modeling (= Phone detection) Given a 39-dimensional vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods Neural nets, CRFs, SVM, etc 4/22/2017 Speech and Language Processing Jurafsky and Martin

Summary ASR Architecture Five easy pieces of an ASR system Training The Noisy Channel Model Five easy pieces of an ASR system Language Model Lexicon/Pronunciation Model (HMM) Feature Extraction Acoustic Model Decoder Training Evaluation 4/22/2017 Speech and Language Processing Jurafsky and Martin