Download presentation
Presentation is loading. Please wait.
Published byDarlene Shepherd Modified over 8 years ago
1
Applications of THE MODULATION SPECTRUM For Speech Engineering Hynek Hermansky IDIAP, Martigny, Switzerland Swiss Federal Institute of Technology, Lausanne, Switzerland Additional material: http://people.idiap.ch/hynek www.silicon-speech.com/modspec
2
source speech spectrum Filter (vocal tract shape) Motivation Linear model of speech Source signal modulated by changing transfer function of vocal tract
3
time 0 20 ms signal amplitude signal predictable because of resonance properties of vocal tract time frequency 0 2000 ms wnwn temporal evolution of spectral envelope predictable because of inertia of vocal organs
4
Modulation Spectrum of Speech Dominant frequency of change in speech is around 4 Hz To process relevant area of the modulation spectrum, access to longer temporal segments of the signal is necessary
5
RASTA Processing Vocal tract cannot move too slow or too fast rate of change< norm rate of change > norm filter trajectory of parameter as estimated from the signal time corrected trajectory
6
attenuation [dB] 10 110 modulation frequency [Hz] 0300 time [ms] RASTA filter
7
Cortical receptive field time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron relatively long temporal extent and limited frequency range
8
Examples of cortical receptive fields (courtessy of Shihab Shamma)
9
Temporal extent of cortical receptive fields 0300 time [ms] Impulse response of optimized RASTA filter Average of the first two principal components (83% of variance) along temporal axis from about 180 cortical receptive fields (from D. Klein, unpublished)
10
Deriving temporal basis (FIR filters) (with Sarel van Vuuren (1998) and Fabio Valente (2005)) time frequency Phoneme-labeled data (hundreds of hours) linear discriminant analysis u helowrdlo Discriminant matrix: A set of FIR RASTA filters
11
LDA in time Impulse responses ~ 250 ms Frequency responses bandpass ~ 1-15 Hz time [ms] amplitude of impulse response impulse responses modulation frequency [Hz] log magnitude spectrum [dB] frequency responses ~250 ms
12
production perception engineering speech evolved to be perceived –human perception as the optimal receiver of speech good engineering may be consistent with biology signal effect effect ( signal )
13
u helowrdlo coarticulation Coarticulation (inertia of organs of speech production) human auditory perception
14
frequency 16 x 14 bands = 448 projections Emulation of cortical processing (MRASTA) data 1 t0t0 32 2-D projections with variable resolutions time data 2 data N 32 2-D projections with variable resolutions peripheral processing (critical-band spectral analysis)
15
0 -500500 time [ms] 448 outer products (16 different temporal functions and 2 different frequency functions at 14 different frequencies) time frequency one example of the 2-D projection (out of 448 possible) fcfc frequency fcfc f c+1 f c-1 MRASTA emulation of cortical receptive fields All projections with zero means robustness to linear distortions (multi-RASTA)
16
Alternative 2-D bases waveform PLPs STRF bank time Actual measured spectro-termporal cortical receptive fields from ferret (Nima Mesgarani) Outer product of variable length Gabor functions (Michael Kleinschmidt et al.)
17
peripheral-like processing higher-level (cortical-like) processing classifier (trained neural net) phoneme posteriors onenine threeseventwo /n//w/ /n/ /th//iy//s//v//n//uw/ /w/ /sil//ah/ /ay/ /sil//r/ /eh//th//sil//ah/ posteriogram most efficient (smallest) set of features are class posteriors (e.g. Fukunaga) phoneme class feed-forward neural net TIME FREQUENCY 448 time-frequency projections (MRASTA) ~ 1 s
18
pre-softmax outputs principal component projection to HMM neural network trained to estimate posteriors of speech classes measurements pre-processing histogram of one feature correlation matrix of features TANDEM Features for Conventional HMM System
19
Speech intelligibility Speech modulation spectrum peaks around 4 Hz Reverberation acts as low-pass filter in modulation domain Noise reduces depth of the modulation Speech Transmission Index (STI) [Houtgast and Steeneken 1980] Test signal at seven different carrier frequencies with varying modulation frequencies (figure from from H.M.N. Steeneken :The Measurement of Speech Intelligibility 1998 spectral envelope trajectoryenvelope spectrum original reverberant noisy time [s] modulation frequency [Hz] 16 1 4 8
20
Spectro-Temporal Modulation Index (STMF) M. Elihali 2005 2-D “ripple” signals instead of modulated sinusoids Frequency (kHz) ∆A 124816 Time (ms) Frequency w = 4 Hz 0 250
21
time 10-20 ms 200-1000 ms 1-3 Bark Data for modulation spectrum processing
22
PEMO Analysis (Goetingen, Oldenburg)
23
time 200-1000 ms 1-3 Bark all-pole model of part of time-frequency plane Frequency-domain linear prediction (with Marios Athineos and Dan Ellis, 2003)
24
All-pole Model of Temporal Trajectory of Spectral Energy the signal signal power spectrum all-pole model of the power spectrum DCT of the signal Hilbert envelope of the signal all-pole model of the Hilbert envelope conventional LP spectral domain LP
25
signal discrete cosine transform low frequency high frequency prediction all-pole model of low- frequency Hilbert envelope all-pole model of high-frequency Hilbert envelope All-pole Models Of Sub-band Energy Contours
26
Critical-band Spectrum From FFT time tonality Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands time tonality
27
Modulation spectrum processing in speech technology Feature extraction in ASR Historical: H. Hermansky, “Speech beyond 20 ms: Speech Processing in Temporal Domain,” invited keynote lecture in Proceedings of the International Workshop on Human Interface Technology, Aizu, Japan, 1994 Recent:F. Valente and H. Hermansky, “Hierarchical Neural Networks Feature Extraction for LVCSR System”, in Proceedings of the International Conference on Spoken Language Processing, Antwerp, Belgium 2007. Speech intelligibility Historical: Houtgast, T., and Steeneken, H.J.M. The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28: 66-73,1973. Recent: M. Elhilali, Neural Basis and Computational Strategies forAuditory Processing, PhD. Thesis, University of Maryland 2004 Speech Coding Historical: M. Athineos, H. Hermansky, and D. P. W. Ellis, "LP-TRAP: Linear Predictive Temporal Patterns," Proc. ICSLP, pp. 1154-1157, Jeju, S. Korea, October 2004 Recent: P. Motlicek, V. Ullal and H. Hermansky, “Wide-Band Perceptual Audio Coding based on Frequency-Domain Linear Prediction,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, 2007.
28
Thank You Conclusion of Part 3 (of 3) THE MODULATION SPECTRUM And Its Application to Speech Science and Technology
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.