Feature Extraction for speech applications Chapters 19-22.

Feature Extraction for speech applications Chapters 19-22

The course so far Brief introduction to speech analysis and recognition for humans and machines Some basics on speech production, acoustics, pattern classification, speech units

Where to next Multi-week focus on audio signal processing Start off with the “front end” for ASR Goal: generate features good for classification Waveform is too variable Current front ends make some sense in terms of signal characteristics alone (production model) - recall the spectral envelope But analogy to perceptual system is there too A bit of this now (much more on ASR in April)

Biological analogy Essentially all ASR front ends start with a spectral analysis stage “Output” from the ear is frequency dependent Target-probe experiments going back to Fletcher (remember him?) suggest a “critical band” Other measurements also suggest similar mechanism (linear below 1kHz, log above)

Basic Idea (Fletcher) Look at response to pure tone in white noise Set tone level to just audible Reduce noise BW, initially same threshold For noise BW below critical value, audibility threshold goes down Presence or absence of tone based on SNR within the band

Feature Extraction for ASR Spectral (envelope) Analysis Auditory Model/ Normalizations

Deriving the envelope (or the excitation) excitation Time-varying filter e(n) h t (n) y(n)=e(n)*h t (n) HOW CAN WE GET e(n) OR h(n) from y(n)?

But first, why? Excitation/pitch: for vocoding; for synthesis; for signal transformation; for prosody extraction (emotion, sentence end, ASR for tonal languages …); for voicing category in ASR Filter (envelope): for vocoding; for synthesis; for phonetically relevant information for ASR Frequency dependency appears to be a key aspect of a system that works - human audition

Spectral Envelope Estimation Filters Cepstral Deconvolution (Homomorphic filtering) LPC

Channel vocoder (analysis) e(n)*h(n) Broad w.r.t harmonics

Rectifier Low-pass filterBand-pass filter A BC B C A Bandpass power estimation

speech BP 1 BP 2 BP N rectify LP 1 LP 2 LP N decimate Magnitude signals Deriving spectral envelope with a filter bank

Filterbank properties Original Dudley Voder/Vocoder: 10 filters, 300 Hz bandwidth (based on # fingers!) A decade later, Vaderson used 30 filters, 100 Hz bandwidth (better) Using variable frequency resolution, can use 16 filters with the same quality

Mel filterbank Warping function B(f) = 1125 ln (1 + f/700) Based on listening experiments with pitch (mel is for “melody”)

Other warping functions Bark(f) = [26.8 /(1 + (1960/f))] - 0.53 (named after Barkhausen, proposed loudness scale) Based on critical band estimates from masking experiments ERB(f) = 21.4 log10(1+ 4.37f/1000) (Equivalent Rectangular Bandwidth) Similarly based on masking experiments, but with better estimates of auditory filter shape

All together now

Towards other deconvolution methods Filters seem biologically plausible Other operations could potentially separate excitation from filter Periodic source provides harmonics (close together in frequency) Filter provides broad influence (envelope) on harmonic series Can we use these facts to separate?

“Homomorphic” processing Linear processing is well-behaved Some simple nonlinearities also permit simple processing, interpretation Logarithm a good example; multiplicative effects become additive Sometimes in additive domain, parts more separable Famous example: “blind” deconvolution of Caruso recordings

Oppenheim: Then all speech compression systems and many speech recognition systems are oriented toward doing this deconvolution, then processing things separately, and then going on from there. A very different application of homomorphic deconvolution was something that Tom Stockham did. He started it at Lincoln and continued it at the University of Utah. It has become very famous, actually. It involves using homomorphic deconvolution to restore old Caruso recordings. Goldstein: I have heard about that. Oppenheim: Yes. So you know that's become one of the well-known applications of deconvolution for speech. … Oppenheim: What happens in a recording like Caruso's is that he was singing into a horn that to make the recording. The recording horn has an impulse response, and that distorts the effect of his voice, my talking like this. [cupping his hands around his mouth] Goldstein: Okay. IEEE Oral History Transcripts: Oppenheim on Stockham’s Deconvolution of Caruso Recordings (1)

Oppenheim: So there is a reverberant quality to it. Now what you want to do is deconvolve that out, because what you hear when I do this [cupping his hands around his mouth] is the convolution of what I'm saying and the impulse response of this horn. Now you could say, "Well why don't you go off and measure it. Just get one of those old horns, measure its impulse response, and then you can do the deconvolution." The problem is that the characteristics of those horns changed with temperature, and they changed with the way they were turned up each time. So you've got to estimate that from the music itself. That led to a whole notion which I believe Tom launched, which is the concept of blind deconvolution. In other words, being able to estimate from the signal that you've got the convolutional piece that you want to get rid of. Tom did that using some of the techniques of homomorphic filtering. Tom and a student of his at Utah named Neil Miller did some further work. After the deconvolution, what happens is you apply some high pass filtering to the recording. That's what it ends up doing. What that does is amplify some of the noise that's on the recording. Tom and Neil knew Caruso's singing. You can use the homomorphic vocoder that I developed to analyze the singing and then resynthesize it. When you resynthesize it you can do so without the noise. They did that, and of course what happens is not only do you get rid of the noise but you get rid of the orchestra. That's actually become a very fun demo which I still play in my class. This was done twenty years ago, but it's still pretty dramatic. You hear Caruso singing with the orchestra, then you can hear the enhanced version after the blind deconvolution, and then you can also hear the result after you get rid of the orchestra,. Getting rid of the orchestra is something you can't do with linear filtering. It has to be a nonlinear technique. IEEE Oral History Transcripts (2)

Log processing Suppose y(n) = e(n)*h(n) Then Y(f) = E(f)H(f) And logY(f) = log E(f) + log H(f) In some cases, these pieces are separable by a linear filter If all you want is H, processing can smooth Y(f)

Windowed speech FFT Log magnitude FFT Time separation Spectral function Excitation Pitch detection Source-filter separation by cepstral analysis

Cepstral features Typically truncated (smooths the estimate; why?) Corresponds to spectral envelope estimation Features also are roughly orthogonal Common transformation for many spectral features Used almost universally for ASR (in some form) To reconstruct speech (without min phase assumption) need complex cepstrum

An alternative: Incorporate Production Assume simple excitation/vocal tract model Assume cascaded resonators for vocal tract frequency response (envelope) Find resonator parameters for best spectral approximation

Resonator frequency response

Where r = pole magnitude, θ = pole angle Pole-only (complex) resonator

where Error Signal

Some LPC Issues Error criterion Model order

Error Criterion

LPC Peak Modeling Total error constrained to be (at best) gain factor squared Error where model spectrum is larger contributes less Model spectrum tends to “hug” peaks

LPC spectra and error

More effects of LPC error criterion Globally tracks, but worse match in log spectrum for low values “Attempts” to model anti-aliasing filter, mic response Ill-conditioned for wide-ranging spectral values

Other LPC properties Behavior in noise Sharpness of peaks Speaker dependence

LPC Model Order Too few, can’t represent formants Too many, model detail, especially harmonics Too many, low error, ill-conditioned matrices

LPC Speech Spectra

LPC Prediction error

Optimal Model Order Akaike Information Criterion (AIC) Cross-validation (trial and error)

Coefficient Estimation Minimize squared error - set derivs to zero Compute in blocks or on-line For blocks, use autocorrelation or covariance methods (pertains to windowing, edge effects)

for Where is a correlation sum between versions of the signal delayed by i and j points Minimizing the error criterion If we take partial derivatives with respect to each

Solving the Equations Autocorrelation method: Levinson or Durbin recursions, O(P 2 ) ops; uses Toeplitz property (constant along left-right diagonals), guaranteed stable Covariance method: Cholesky decomposition, O(P 3 ) ops; just uses symmetry property, not guaranteed stable

LPC-based representations Predictor polynomial - a i, 1<=i<=p, direct computation Root pairs - roots of polynomial, complex pairs Reflection coefficients - recursion; interpolated values always stable (also called PARCOR coefficients k i, 1<=i<=p) Log area ratios = ln((1-k)/(1+k)), low spectral sensitivity Line spectral frequencies - freq. pts around resonance; low spectral sensitivity, stable Cepstra - can be unstable, but useful for recognition

LPC analysis block diagram

Spectral Estimation Filter Banks Cepstral Analysis LPC Reduced Pitch Effects Excitation Estimate Direct Access to Spectra Less Resolution at HF Orthogonal Outputs Peak-hugging Property Reduced Computation X X X XX XX X X X

Feature Extraction for ASR Chapter 22

ASR Front End Coarse spectral representation (envelope) Coarsest for high frequencies Limitations for each basic type (filter bank, cepstrum, LPC)

Limitations for archetypes Filter banks  correlated outputs, no focus on peaks Cepstral analysis  uniform spectral resolution, no focus on peaks LPC  uniform spectral resolution Solution: hybrid approaches

Two “Standards” Mel Cepstrum: Bridle (1974), Davis and Mermelstein (1980) Perceptual Linear Prediction (PLP): Hermansky, ~1985, 1990

Preemphasis FFT | | 2 Critical bands Compression IFFT Smoothing Liftering Cepstral truncation LPC Analysis Cube RootLog Trapezoidal Triangular Single Zero FIR Done in Crit. Band step Mel Cepstral Analysis PLP Analysis

Perceptual Linear Prediction (PLP) [Hermansky 1990] Auditory-like modifications of short-term speech spectrum prior to its approximation by all-pole autoregressive model (or cepstral truncation in case of MFCC) –critical-band spectral resolution –equal-loudness sensitivity –intensity-loudness nonlinearity These 3 applied in virtually all state-of-the-art experimental ASR systems

Steps 2-4 of PLP

Dynamic Features Delta features - local slope in cepstrum Computed by filtering/linear regression Higher derivatives often used now Typically used in combination w/ “static” features

Speaker robustness - VTLN Different vocal tract lengths -> different formant positions (e.g., male vs female) Expansion/compression can be estimated Typically use statistical modeling to optimize Can look at characteristics like pitch or 3rd formant

Acoustic (environment) robustness Convolutional error (e.g., microphone, channel spectrum) Additive noise (e.g., fans, auto engine) Limitations for typical solutions: time-invariant or slowly varying, linear, phone-independent

Key Processing Step for ASR: Cepstral Mean Subtraction Imagine a fixed filter h(n), so x(n)=s(n)*h(n) Same arguments as before, but - let s vary over time - let h be fixed over time Then average cepstra should represent the fixed component (including fixed part of s) (Think about it)

Convolutional Error X(ω,t) = S(ω,t)H(ω,t) |X(ω,t)| 2 = |S(ω,t)| 2 |H(ω,t)| 2 log |X(ω,t)| 2 = log|S(ω,t)| 2 + log |H(ω,t)| 2 C x (n,t) = C S (n,t) + C H (n,t)

Convolutional error strategies Blind deconvolution/cepstral mean subtraction: Atal 1974 On-line method- RelAtive SpecTral Analysis (RASTA): Hermansky and Morgan, 1991

Some variants on CMS Subtract utterance mean from each cepstral coefficient Compute mean over a longer region (e.g., conversational side) Compute a running mean Use the mean from the last utterance Also divide by std deviation

“Standard” RASTA

Some of the proposed improvements to RASTA Run backwards and forwards in time (gets rid of phase in transfer fn) Train filter on data (discriminative RASTA) Use multiple filters Use in combination with Wiener filtering

Long-time convolution Reverberation has effects beyond the typical analysis frame Can do log spectral subtraction w/ long frames Alternatively, smear system training data to improve match to temporal smearing in test In practice, this is an unsolved problem (especially when noise is present, i.e., always)

Additive noise (stationary) Subtract off noise spectral estimate Need a noise estimate Use a second microphone if you have it

Wiener filter /spectral subtraction Assume that X = S + N (suppressing freq dep in notation) If uncorrelated, |X| 2 = |S| 2 + |N| 2 (PSDs) |S est | 2 = |X| 2 - |N est | 2, or |H| 2 = 1 - |N| 2 / |X| 2 If no channel effect, Wiener filter is H = |S| 2 / (|S| 2 + |N| 2 ) So Wiener filter is H = 1 - |N| 2 / |X| 2 Similar effect but for exponents In practice many variants - also smoothing to avoid “musical noise”

Just Suppose … What if, for some ω,|N est | 2 › |X| 2 ? Then |S est | 2 = |X| 2 - |N est | 2 is negative But if it is a PSD … So, what should we do?

Piano with noise Piano with noise Piano with noise and Wiener filtering Piano with noise and Wiener filtering

ETSI standard: AFE Aurora competition AFE = “Advanced Front End” Noise est., Wiener filtering, done twice Emphasis on high SNR parts of waveform Other methods did well later (e.g., Ellis – Tandem [MLP+HMM], 2 streams, PLP+MSG)

Modulation-filtered SpectroGram (MSG) Kingsbury, 1998 Berkeley PhD thesis

Noise and convolution Can use a different form of RASTA: “J-RASTA” Filters log-like function of spectrum: f(x) = log( 1 + Jx) where J 1/Noise power Many other methods (primarily statistical) None lower word error rates to clean levels

Noise and convolution - other compensation methods Given “stereo” data, find additive vector to best match the cepstra Get data from multiple testing environments/microphones, find best match Vector Taylor Series methods (approx effect on cepstra of noise, convolution) SPLICE (Stereo-based Piecewise LInear Compensation for Environments) methods Or else, adaptation of stat model

Noise and convolution - what would we really want? For online case, would like to be insensitive to noise and convolutional errors Would like to do this without needing known noise regions People can do this So - study auditory system?

“Auditory” properties in speech “front ends” Nonlinear spacing/bandwidth for filter bank Compression (log for MFCC, cube root for PLP) Preemphasis/equal loudness Smoothing for envelope estimate Insensitivity to constant spectral multiplier

Auditory Models Shifting definitions Typically means whatever we aren’t using yet Example: Ensemble Interval Histogram (EIH)  looking for coherence across bands of histogram of threshold crossings

Seneff Auditory Model

Auditory Models (cont.) Representation of cochlear output - e.g., the cochleagram Representation of temporal information - the correlogram (particularly for pitch) - shows autocorrelation function for each spectral component; i.e., frequency vs lag

Correlogram example 1

Other perspectives Temporal information in individual bands (TRAPS/HATS) Spectro-Temporal Receptive Fields (models from ferret brain experiments) Multiple mappings for greater robustness, including more “sluggish” features

ASR Systems are half-deaf Phonetic classification is very poor (even in low- noise conditions) Success is due to constraints (domain, speaker, noise- canceling mic, etc) These constraints can mask the underlying weakness of the technology

time Pushing the envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 25 ms (stepped by 10 ms) OLD NEW Solution: Probabilities from multiple time-frequency patches i-th estimate up to 1s k-th estimate n-th estimate estimate of sound identity

Narrowband 500 ms (HATS)Broadband 100 ms (Tandem) Broadband 25 ms MLP 13 overlapping spectral slices 9 frames, PLP cepstra posteriors combine concatenate features 1 frame, PLP cepstra Multi-rate features

Multiple microphones Array approaches for beamforming “Distant” second microphone for noise estimate  use cross-correlation to derive transfer fn for noise to get from noise sensor to signal sensor

The Rest of the System Focus has been on features Feature choice affects statistics Noise/channel robustness strategies often focus on the statistical models For now, we will focus on a deterministic view - later, deterministic ASR (ch 24) First, pitch and general audio – chap. 16, 31, 35, 37, 39

End - feature extraction; on to DTW …

Feature Extraction for speech applications Chapters 19-22.

Similar presentations

Presentation on theme: "Feature Extraction for speech applications Chapters 19-22."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feature Extraction for speech applications Chapters 19-22.

Similar presentations

Presentation on theme: "Feature Extraction for speech applications Chapters 19-22."— Presentation transcript:

Similar presentations

About project

Feedback