CS 224S / LINGUIST 285 Spoken Language Processing

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Building an ASR using HTK CS4706
Digital Signal Processing
Speech Recognition with Hidden Markov Models Winter 2011
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
LSA 352 Speech Recognition and Synthesis
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Speaker Adaptation for Vowel Classification
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Lecture 1 Signals in the Time and Frequency Domains
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech and Language Processing
CSC361/661 Digital Media Spring 2002
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
Speech and Language Processing
Transforms. 5*sin (2  4t) Amplitude = 5 Frequency = 4 Hz seconds A sine wave.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Implementing a Speech Recognition System on a GPU using CUDA
Evaluation of Speaker Recognition Algorithms. Speaker Recognition Speech Recognition and Speaker Recognition speaker recognition performance is dependent.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
Speech Processing Using HTK Trevor Bowden 12/08/2008.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Fundamentals of Multimedia Chapter 6 Basics of Digital Audio Ze-Nian Li and Mark S. Drew 건국대학교 인터넷미디어공학부 임 창 훈.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
7.0 Speech Signals and Front-end Processing References: , 3.4 of Becchetti of Huang.
CS 224S / LINGUIST 285 Spoken Language Processing
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Lecture 9: Speech Recognition (I) October 26, 2004 Dan Jurafsky
Speech Processing Speech Recognition
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
EE513 Audio Signals and Systems
CS 188: Artificial Intelligence Spring 2006
Govt. Polytechnic Dhangar(Fatehabad)
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 6: Feature Extraction

Outline Feature extraction How to compute MFCCs Dealing with variation Adaptation MLLR MAP Lombard speech Foreign accent Pronunciation variation

Discrete Representation of Signal Represent continuous signal into discrete form. Image from Bryan Pellom

Sampling Measuring amplitude of a signal at time t The sample rate needs to have at least two samples for each cycle One for the positive, and one for the negative half of each cycle More than two samples per cycle is ok Less than two samples will cause frequencies to be missed So the maximum frequency that can be measured is one that is half the sampling rate. The maximum frequency for a given sampling rate called Nyquist frequency

Sampling Original signal in red: If measure at green dots, will see a lower frequency wave and miss the correct higher frequency one!

Sampling In practice we use the following sample rates 16,000 Hz (samples/sec), for microphones, “wideband” 8,000 Hz (samples/sec) Telephone Why? Need at least 2 samples per cycle Max measurable frequency is half the sampling rate Human speech < 10KHz, so need max 20K Telephone is filtered at 4K, so 8K is enough.

Digitizing Speech (II) Quantization Representing real value of each amplitude as integer 8-bit (-128 to 127) or 16-bit (-32768 to 32767) Formats: 16 bit PCM 8 bit mu-law; log compression LSB (Intel) vs. MSB (Sun, Apple) Headers: Raw (no header) Microsoft wav Sun .au 40 byte header

WAV format 1/5/07

Discrete Representation of Signal Byte swapping Little-endian vs. Big-endian Some audio formats have headers Headers contain meta-information such as sampling rates, recording condition Raw file refers to 'no header' Example: Microsoft wav, Nist sphere Nice sound manipulation tool: Sox http://sox.sourceforge.net/ change sampling rate convert speech formats

MFCC Mel-Frequency Cepstral Coefficient (MFCC) Most widely used spectral representation in ASR

Pre-Emphasis Pre-emphasis: boosting the energy in the high frequencies Q: Why do this? A: The spectrum for voiced segments has more energy at lower frequencies than higher frequencies. This is called spectral tilt Spectral tilt is caused by the nature of the glottal pulse Boosting high-frequency energy gives more info to the Acoustic Model Improves phone recognition performance

George Miller figure

Example of pre-emphasis Spectral slice from the vowel [aa] before and after pre-emphasis

MFCC

Windowing Image from Bryan Pellom

Windowing Why divide speech signal into successive overlapping frames? Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. Frames Frame size: typically, 10-25ms Frame shift: the length of time between successive frames, typically, 5-10ms

Common window shapes Rectangular window: Hamming window

Window in time domain

MFCC

Discrete Fourier Transform Input: Windowed signal x[n]…x[m] Output: For each of N discrete frequency bands A complex number X[k] representing magnidue and phase of that frequency component in the original signal Discrete Fourier Transform (DFT) Standard algorithm for computing DFT: Fast Fourier Transform (FFT) with complexity N*log(N) In general, choose N=512 or 1024

Discrete Fourier Transform computing a spectrum A 25 ms Hamming-windowed signal from [iy] And its spectrum as computed by DFT (plus other smoothing)

MFCC

Mel-scale Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz I.e. human perception of frequency is non-linear:

Mel-scale A mel is a unit of pitch Pairs of sounds perceptually equidistant in pitch are separated by an equal number of mels Mel-scale is approximately linear below 1 kHz and logarithmic above 1 kHz

Mel Filter Bank Processing Roughly uniformly spaced before 1 kHz logarithmic scale after 1 kHz

Mel-filter Bank Processing Apply the bank of Mel-scaled filters to the spectrum Each filter output is the sum of its filtered spectral components

MFCC

Log energy computation Compute the logarithm of the square magnitude of the output of Mel-filter bank

Log energy computation Why log energy? Logarithm compresses dynamic range of values Human response to signal level is logarithmic humans less sensitive to slight differences in amplitude at high amplitudes than low amplitudes Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) Phase information not helpful in speech

MFCC

The Cepstrum One way to think about this Separating the source and filter Speech waveform is created by A glottal source waveform Passes through a vocal tract which because of its shape has a particular filtering characteristic Remember articulatory facts from lecture 2: The vocal cord vibrations create harmonics The mouth is an amplifier Depending on shape of oral cavity, some harmonics are amplified more than others

Vocal Fold Vibration UCLA Phonetics Lab Demo

George Miller figure

We care about the filter not the source Most characteristics of the source F0 Details of glottal pulse Don’t matter for phone detection What we care about is the filter The exact position of the articulators in the oral tract So we want a way to separate these And use only the filter function

The Cepstrum The spectrum of the log of the spectrum Spectrum Log spectrum Spectrum of log spectrum

Thinking about the Cepstrum

Mel Frequency cepstrum The cepstrum requires Fourier analysis But we’re going from frequency space back to time So we actually apply inverse DFT Details for signal processing gurus: Since the log power spectrum is real and symmetric, inverse DFT reduces to a Discrete Cosine Transform (DCT)

Another advantage of the Cepstrum DCT produces highly uncorrelated features If we use only the diagonal covariance matrix for our Gaussian mixture models, we can only handle uncorrelated features. In general we’ll just use the first 12 cepstral coefficients (we don’t want the later ones which have e.g. the F0 spike)

MFCC

Dynamic Cepstral Coefficient The cepstral coefficients do not capture energy So we add an energy feature

“Delta” features Speech signal is not constant slope of formants, change from stop burst to release So in addition to the cepstral features Need to model changes in the cepstral features over time. “delta features” “double delta” (acceleration) features

Delta and double-delta Derivative: in order to obtain temporal information

Typical MFCC features Window size: 25ms Window shift: 10ms Pre-emphasis coefficient: 0.97 MFCC: 12 MFCC (mel frequency cepstral coefficients) 1 energy feature 12 delta MFCC features 12 double-delta MFCC features 1 delta energy feature 1 double-delta energy feature Total 39-dimensional features

Why is MFCC so popular? Efficient to compute Incorporates a perceptual Mel frequency scale Separates the source and filter IDFT(DCT) decorrelates the features Necessary for diagonal assumption in HMM modeling There are alternatives like PLP

Feature extraction for DNNs Mel-scaled log energy For DNN (neural net) acoustic models instead of Gaussians We don’t need the features to be decorrelated So we use mel-scaled log-energy spectral features instead of MFCCs Just run the same feature extraction but skip the discrete cosine transform.

Acoustic modeling of variation Variation due to speaker differences Speaker adaptation MLLR MAP Splitting acoustic models by gender Speaker adaptation approaches also solve Variation due to environment Lombard speech Foreign accent Acoustic and pronunciation adaptation to accent Variation due to genre differences Pronunciation modeling

Acoustic Model Adaptation Shift the means and variances of Gaussians to better match the input feature distribution Maximum Likelihood Linear Regression (MLLR) Maximum A Posteriori (MAP) Adaptation For both speaker adaptation and environment adaptation Widely used!

Maximum Likelihood Linear Regression (MLLR) Leggetter, C.J. and P. Woodland. 1995. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Computer Speech and Language 9:2, 171-185. Given: a trained AM a small “adaptation” dataset from a new speaker Learn new values for the Gaussian mean vectors Not by just training on the new data (too small) But by learning a linear transform which moves the means.

Maximum Likelihood Linear Regression (MLLR) Estimates a linear transform matrix (W) and bias vector () to transform HMM model means: Transform estimated to maximize the likelihood of the adaptation data Slide from Bryan Pellom

MLLR New equation for output likelihood

MLLR Q: Why is estimating a linear transform from adaptation data different than just training on the data? A: Even from a very small amount of data we can learn 1 single transform for all triphones! So small number of parameters. A2: If we have enough data, we could learn more transforms (but still less than the number of triphones). One per phone (~50) is often done.

MLLR: Learning Given an small labeled adaptation set (a couple sentences) a trained AM Do forward-backward alignment on adaptation set to compute state occupation probabilities γj(t). W can now be computed by solving a system of simultaneous equations involving γj(t)

MLLR performance on baby task (RM) (Leggetter and Woodland 1995) Only 3 sentences! 11 seconds of speech!

MLLR doesn’t need supervised adaptation set!

Slide from Bryan Pellom

Slide from Bryan Pellom after Huang et al

Summary MLLR: works on small amounts of adaptation data MAP: Maximum A Posterior Adaptation Works well on large adaptation sets Acoustic adaptation techniques are quite successful at dealing with speaker variability If we can get 10 seconds with the speaker.

Sources of Variability: Environment Noise at source Car engine, windows open Fridge/computer fans Noise in channel Poor microphone Poor channel in general (cellphone) Reverberation Lots of research on noise-robustness Spectral subtraction for additive noise Cepstral Mean Normalization Microphone arrays

What is additive noise? Sound pressure for two non coherent sources ps : speech pn : noise source p : mixture of speech and noise sources Slide from Kalle Palomäki, Ulpu Remes, Mikko Kurimo

What is additive noise? SNR = -6 dB SNR = -18 dB SNR = -2 dB Slide from Kalle Palomäki, Ulpu Remes, Mikko Kurimo

Additive Noise: Spectral Subtraction Find some silence in the signal, record it and compute the spectrum of the noise. Subtract this spectrum from the rest of the signal Hope that the noise is constant. There are weird artifacts of the subtraction that have to be cleaned up

Additive Noise: Parallel Model Combination Best but impossible: train models with exact same noisy speech as test set Instead: Collect noise in test, generate a model Combine noise model and clean-speech models in real-time performed on model parameters in cepstral domain Noise and signal are additive in linear domain so transform the parameters from cepstral to linear domain for combination Model combination Noise HMM Clean speech HMM Linear Spectral domain Noisy speech HMM Cepstral domain C-1 exp C log Slide from Li Lin Shan 李琳山

Cepstral Mean Normalization Microphone, room acoustics, etc. Treat as channel distortion A linear filter h[n] convolved with the signal ŷ[n] = x[n] ∗ ĥ[n] In frequency space Yˆ (k) = Xˆ(k)H^(k) In log frequency space: log Yˆ (k) = log Xˆ(k) + log Hˆ (k) H is constant for a given sentence So subtracting the mean of the sentence, we eliminate this constant filter

Sources of Variability: Genre/Style/Task Read versus conversational speech Lombard speech Domain (Booking restaurants, dictation, or meeting summarization)

One simple example: The Lombard effect Changes in speech production in the presence of background noise Increase in: Amplitude Pitch Formant frequencies Result: intelligibility (to humans) increases

Lombard Speech Me talking over silence Me talking over Ray Charles: longer, louder, higher Lombard Speech Me talking over silence

Analysis of Speech Features under LE Fundamental Frequency Slides from John Hansen

Analysis of Speech Features under LE Formant Locations Slides from John Hansen

One solution to Lombard speech MLLR

Sources of Variability: Speaker Gender Dialect/Foreign Accent Individual Differences Physical differences Language differences (“idiolect”)

VTLN Speakers overlap in their phones Vowel from different speakers:

VTLN Vocal Tract Length Normalization Remember we said vocal tract was tube of length L If you scale the tube by a factor k, the new length L’ = kL Now the formants are scaled by 1/k In decoding, try various ks warp the frequency axis linearly during the FFT computation so as to fit some “canonical” speaker. Then compute MFCCs as usual Slide adapted from Chen, Picheny, Eide

Acoustic Adaptation to Foreign Accent Train on accented data (if you have enough) Otherwise, combine MLLR and MAP MLLR (move the means toward native speakers) MAP (mix accented and native speakers with weights)

Variation due to task/genre Probably largest remaining source of error in current ASR I.e., is an unsolved problem Maybe one of you will solve it!

Variation due to the conversational genre Weintraub, Taussig, Hunicke-Smith, Snodgrass. 1996. Effect of Speaking Style on LVCSR Performance. SRI collected a spontaneous conversational speech corpus, in two parts: Spontaneous Switchboard-style conversation on an assigned topic A reading session in which participants read transcripts of their own conversations 2. As if they were dictating to a computer 3. As if they were having a conversation

How do 3 genres affect WER? WER on exact same words: Speaking Style Word Error Read Dictation 28.8% Read Conversational 37.6% Spontaneous Conversation 52.6% Solution: it’s not the words, it’s something about the pronunciation of spontaneous speech

Conversational pronunciations! Switchboard corpus I was like, “Itʼs just a stupid bug!” ax z l ay k ih s jh ah s t ey s t uw p ih b ah g HMMs built from pronunciation dictionary Actual phones don’t match dictionary sequence! I was: ax z not ay w ah z It’s: ih s not ih t s

Testing the hypothesis that pronunciation is the problem Saraclar, M, H. Nock, and S. Khudanpur. 2000. Pronunciation modeling by sharing Gaussian densities across phonetic models. Computer Speech and Language 14:137-160. “Cheating experiment” or “Oracle experiment” What is error rate if oracle gave perfect knowledge? What if you knew what pronunciation to use? Extracted the actual pronunciation of each word in Switchboard test set from phone recognizer Use that pronunciation in dictionary Baseline SWBD system WER 47% Oracle Pronunciation Dictionary 27%

Solutions We’ve tried many things Nothing has worked well Multiple pronunciations per word Decision trees to decide how a word should be pronounced Nothing has worked well Conclusions so far: triphones do well at minor phonetic variation the problem is massive deletions of phones

Pronunciation modeling in current recognizers Use single pronunciation for a word How to choose this pronunciation? Generate many pronunciations Forced alignment on training set Merge similar pronunciations For each pronunciation in dictionary If it occurs in training, pick most likely pronunciation Else learn some mappings from seen pronunciations, apply these to unseen pronunciations

Outline Feature extraction MFCC Dealing with variation Adaptation MLLR MAP Lombard speech Foreign accent Pronunciation variation