Speech Recognition Chapter 3
Speech Front-Ends Linear Prediction Analysis Linear-Prediction Based Processing Cepstral Analysis Auditory signal Processing
Linear Prediction Analysis Introduction Linear Prediction Model Linear Prediction Coefficients Computation Linear Prediction for Automatic Speech Recognition Linear Prediction in Speech Processing How good is the LP Model.
Signal Processing Front End Convert the speech waveform in some type of parametric representation. sk Filterbank Signal Processing Front End Linear Prediction Front End Linear Prediction Coefficients O=o(1)o(2)..o(T)
Introduction In short intervals, it provides a good model of the speech. Mathematical precise and simple. Easy to implement in software or hardware. Works fine for recognition applications. It also has applications in formant and pitch estimation, speech coding and synthesis.
Linear Prediction Model Basic idea: are called LP(Linear Prediction) coefficients. By including the excitation signal, we obtain: where is the normalised excitation and is the gain of the excitation.
In the z-domain (secc. 1.1.4, pp. 15, Deller) leading to the transfer function (Fig. 3.27)
LP model retains the spectral magnitude, but it has a minimum phase (Sec. 1.1.7, Deller) feature. However, in practice, phase is not very important for speech perception. Observation: H(z) models the glottal filter(G(z)) and the lips radiation(R(z).
Linear Prediction Coefficients Computation Introduction Methogologies
Linear Prediction Coefficients Computation LP coefficients can be obtained by solving the next equation system (Secc. 3.3.2, Prove ):
Methodologies Autocorrelation Method Covariance Method Not commonly used in Speech Recognition
Autocorrelation Method Assumptions: Each frame is independent (Fig. 3.29 ). Solution (Juang, secc. 3.3.3 pp105-106): where (2) M es el número de parametros LPC. These equations are know as Yule-Walker equations.
Using matrix notation: or
Features Symetric. Diagonal elements are the same. Toeplitz Matriz
This matrix is known as Toeplitz This matrix is known as Toeplitz. A linear system with this matrix can be solved very efficient. Examples (Fig. 3.32 and 3.33 ) Example (Fig. 3.34 ) Example (Fig. 3.35 ) Example (Fig. 3.36 )
Linear Prediction for Automatic Speech Recogition To minimise signal discontinuity Flats the spectrum equation (2) usually M=8 Incorporate signal dynamics to minimise noise sensitivity To Cepstral Coefficients Durbin Algorithm
Preemphasis The transfer function of the glottis can be modelled as follows: The radiation effect can be modelled as follows:
Hence, to obtain the transfer function of the vocal tract the other pole must be cancelled as follows:.
Preemphasis sould be done only for sonorant sounds. This process can be automated as follows. where is the autocorrelation function.
N samples size frame, M samples frame shift
Minimize signal discontinuities at the edges of the frames. A typical window is the Hamming window.
LPC Analysis Converts the autocorrelations coefficients into LPC “parameter set”. LPC Parameter set LPC coefficients Reflection (PARCOR) coefficients log area ratio coefficients The formal method to obtain the LPC parameter set is know as Durbin’s method.
Durbin’s method
LPC (Typical values)
LPC Parameter Conversion Conversion to Cepstral Coeficients. Robust feature set for speech recognition. Algorithm:
Parameter weighting low-order cepstral coefficents are highly sensibles to noise
Temporal Cepstral Derivative First or second order derivatives is enough. It can be aproximated as follows:
Given
Hamming Windowed Large prediction errors since speech is predicted form previous samples arbitray set to zero.
Large prediction errors since speech is predicted form previous samples arbitray set to zero.
Unvoiced signals are not position sensitive. It does not show special effect at the edges.
Observe the “whitening” phenomena at the error spectrum.
Observe the “whitening phenomena at the error specturm
Observe the error wave periodicity behaviour taken as bases for the Pitch Estimators.
Observe that a sharp decrease in the prediction error is obtain for small M value (M=1...4). Observe that unvoiced signal has higher RMS error.
Observe the all-pole model ability to match the spectrum.
Linear Prediction in Speech Processing LPC for Vocal Tract Shape Estimation LPC for Pitch Detection LPC for Formant prediction
LPC for Vocal Tract Shape Estimation To minimise signal discontinuity Free of glottis and radiation effects Vocal Tract Shape Estimation Parameter Calculation to minimise noise sensitivity To Cepstral Coefficients
Parameter Calculation Durbin’s Method (As in Speech Recognition) In case, this method is used, first the autocorrelation analysis should be performed. Lattice Filter
Lattice Filter The reflection coefficients are obtain directly form the signal, avoiding the autocorrelation analysis. Methods: Itakura-Saito (Parcor) Burg New forms Advantage: Easier to implement in Hardware Disadvantage: needs around 5 times more calculation.
Itakura-Saito (PARCOR) where Accumulates over time (n). It can be shown that the PARCOR coefficients, obtain for the Itakura-Saito method are exactly the same as the reflection coefficients obtained by the Levison Durbin algorithm. Example
Burg where Example
Example Itakura-Saito Burg
New Forms Stroback, New forms of Levinson and Schur algorithms, IEEE Signal Processing Magazine, pp. 12-36, 1991.
Vocal Tract Shape Estimation From: We obtain Therefore, by setting the the lips area to an arbitrary value we can obtain the vocal tract configuration relative to the initial condition. This technique as been succesfully used to train deaf persons.
LPC for Pitch Detection Speech Sampled at 10KHz Inverse Filering A(z) LPF 800Hz DownSampler 5:1 Peak finding Autocorrelation LPC Analysis V/U decision or Pitch
LPC for Formant Detection Sampled Speech Formants LPC Spectrum Emphasis Peaks (second derivative) Peak finding LPC Analysis
LPC Spectrum LP assumes that the vocal tract system can be modelled with an all-pole system: The spectrum can be obtain by In order to emphasis formant peaks we can set
In order to increase the spectral resolution we pad with zeros: Therefore Spectrum (DTFT) Spectrum (DFT) In order to increase the spectral resolution we pad with zeros: In order to use an FFT algorithm
Caclulate the Spectral magnitude(DFT) Invert the Spectral magnitude(DFT) This spectrum is called the LPC Spectrum.
How good is the LP Model As shown by the physiological analysis of the vocal tract the speech model is as follows: However, it can be shown ( ), that LP Model is good for estimating the magnitude of pole-zero system.
Prove According to lema 1 ( ) and lema 2 ( ) , can be written as follows: The estimates are calculated such that it correspond to the of this model. All pass component
Since hence therefore, if the estimators, are exacts, then at least we obtain a model with a correct magnitude.
Lema 1 Lema 1(System Decomposition): Any causal ration system can be descomponed as (prove ): Minimal phase component
Prove For two poles and two zeros: Lets define: Re-arranging this equation:
With the knowledge that: Hence:
Therefore: End of prove.
Lema 2 Lema 2: Minimum phase component can be expresed as an all-pole system: in theory goes to infinity, in practice is limited.
Linear Prediction Based Procesing Critics to the Linear Prediction Model Perceptual Linear Prediction (PLP) LP Cepstra
Critics to the Linear Prediction Model The LP spectrum approximate the speech spectrum equally well at all frequencies of the analysis band. This property is inconsistent with the human hearing.
Precepual Linear Prediction (PLP) Critical Band Spectral Analysis Equal Loudness Pre-emphasis Intensity Loudness IDFT Yule-Walker Equations Solutions
Critical Band Analysis Speech Signal Frame Critical Band Spectral Resolution Short-Term Spectra Windowing DFT (20 ms) (200 samples 56 zeros for padding for Ts=10KHz)c DFT (20 ms Hamming Window
Critical-Band Spectral Resolution Frequency Warping (Hertz -> Barks) Convolution and Downsampling filter-bank masking curve approximation
Equal Loudness Pre-emphasis Approximate the non-equal sensitivity of the human hearing at different frequencies.
Intensitive Loudnes Power Law Approximate the non-linear relation between the intensity of sound and its perceived loudness.
Cepstral Analysis Introduction Homomorphic Processing Cepstral Spectrum Cepstrum Mel-Cepstrum Cepstrum in Speech Processing
Introduction When speech is pre-emphasised The excitation is not necessary for estimate the vocal tract function. Therefore, it is desirable to separate the excitation information form the vocal tract information.
We can think the speech spectrum as a signal, we can observer that is composed for the multiplication of a slow signal, and a fast signal, . Therefore, we can try to obtain the best of this knowledge. The formal technique which exploit this feature is called “Homomorphic Processing”.
Homomorphic Processing It is a technique to filter no-lineal systems. In Homomorphic Processing the non-linear related signals are transform the signal to a linear domain. H[ ] F(z) H-1[ ]
log[ ] S+(z) exp[ ] In order to obtain a linear system a complex log transformation is applied to the speech spectrum. log[ ] S+(z) exp[ ]
Cepstral Spectrum Definition. where is the STFT
Cepstrum Definition.
Cepstrum In Speech Processing Pitch Estimation Format Estimation Pitch and Formant Estimation
Pitch Estimation Sampled Speech High-Pass Liftering Emphasis Peaks (second derivative) Peak finding Cepstrum Pitch
Formant Estimation Sampled Speech Low-Pass Liftering Emphasis Peaks (second derivative) Peak finding Cepstrum Formants
Pitch and Formant Estimation Sampled Speech High-Pass Liftering Emphasis Peaks (second derivative) Peak finding Cepstrum Pitch Low-Pass Liftering Emphasis Peaks (second derivative) Peak finding Formants