Applications of THE MODULATION SPECTRUM For Speech Engineering Hynek Hermansky IDIAP, Martigny, Switzerland Swiss Federal Institute of Technology, Lausanne,

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
Advertisements

Analysis and Digital Implementation of the Talk Box Effect Yuan Chen Advisor: Professor Paul Cuff.
Copyright 2001, Agrawal & BushnellVLSI Test: Lecture 181 Lecture 18 DSP-Based Analog Circuit Testing  Definitions  Unit Test Period (UTP)  Correlation.
Advanced Speech Enhancement in Noisy Environments
Signal Analysis Using Autoregressive Models of Amplitude Modulation Sriram Ganapathy Advisor - Hynek Hermansky
Purpose The aim of this project was to investigate receptive fields on a neural network to compare a computational model to the actual cortical-level auditory.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
1 Hough transform Some Fourier basics: –Nyquist frequency: 1/2 , with  the difference between time samples. If signal is bandwidth limited below Nyquist.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Speech recognition from spectral dynamics HYNEK HERMANSKY The Johns Hopkins University, Baltimore, Maryland, USA Presenter : 張庭豪.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Speech Recognition in Noise
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Voice Transformation Project by: Asaf Rubin Michael Katz Under the guidance of: Dr. Izhar Levner.
Message linguistic code (~ 50 b/s) motor control speech production SPEECH SIGNAL (~50 kb/s) speech perception cognitive processes linguistic code (~ 50.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Source/Filter Theory and Vowels February 4, 2010.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
EE513 Audio Signals and Systems Digital Signal Processing (Systems) Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
C enter for A uditory and A coustic R esearch Representation of Timbre in the Auditory System Shihab A. Shamma Center for Auditory and Acoustic Research.
Multiresolution STFT for Analysis and Processing of Audio
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
THE MODULATION SPECTRUM and Its Application to Speech Science and Technology Les Atlas, Steven Greenberg, Hynek Hermansky Interspeech Tutorial August 27,
Speech Coding Using LPC. What is Speech Coding  Speech coding is the procedure of transforming speech signal into more compact form for Transmission.
Acoustic Analysis of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
Dealing with Unknown Unknowns (in Speech Recognition) Hynek Hermansky Processing speech in multiple parallel processing streams, which attend to different.
Concepts of Multimedia Processing and Transmission IT 481, Lecture #4 Dennis McCaughey, Ph.D. 25 September, 2006.
By Sarita Jondhale1 Signal Processing And Analysis Methods For Speech Recognition.
Methods Neural network Neural networks mimic biological processing by joining layers of artificial neurons in a meaningful way. The neural network employed.
Basics of Neural Networks Neural Network Topologies.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Systems (filters) Non-periodic signal has continuous spectrum Sampling in one domain implies periodicity in another domain time frequency Periodic sampled.
SPECTRUM? Hynek Hermansky with Jordan Cohen, Sangita Sharma, and Pratibha Jain,
CS Spring 2009 CS 414 – Multimedia Systems Design Lecture 3 – Digital Audio Representation Klara Nahrstedt Spring 2009.
VOCODERS. Vocoders Speech Coding Systems Implemented in the transmitter for analysis of the voice signal Complex than waveform coders High economy in.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Analysis of spectro-temporal receptive fields in an auditory neural network Madhav Nandipati.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
Lifecycle from Sound to Digital to Sound. Characteristics of Sound Amplitude Wavelength (w) Frequency ( ) Timbre Hearing: [20Hz – 20KHz] Speech: [200Hz.
Auditory Localization in Rooms: Acoustic Analysis and Behavior
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Vocoders.
Spoken Digit Recognition
III. Analysis of Modulation Metrics IV. Modifications
Mplp(t) derived from PLP cepstra,. This observation
1 Vocoders. 2 The Channel Vocoder (analyzer) : The channel vocoder employs a bank of bandpass filters,  Each having a bandwidth between 100 HZ and 300.
Frequency Domain Perceptual Linear Predicton (FDPLP)
Kocaeli University Introduction to Engineering Applications
Month 2002 doc.: IEEE /xxxr0 Nov 2003
8-Speech Recognition Speech Recognition Concepts
EE513 Audio Signals and Systems
Human Speech Communication
Learning Long-Term Temporal Features
Speech Signal Representations
Presentation transcript:

Applications of THE MODULATION SPECTRUM For Speech Engineering Hynek Hermansky IDIAP, Martigny, Switzerland Swiss Federal Institute of Technology, Lausanne, Switzerland Additional material:

source speech spectrum Filter (vocal tract shape)‏ Motivation Linear model of speech Source signal modulated by changing transfer function of vocal tract

time 0 20 ms signal amplitude signal predictable because of resonance properties of vocal tract time frequency ms wnwn temporal evolution of spectral envelope predictable because of inertia of vocal organs

Modulation Spectrum of Speech Dominant frequency of change in speech is around 4 Hz To process relevant area of the modulation spectrum, access to longer temporal segments of the signal is necessary

RASTA Processing Vocal tract cannot move too slow or too fast rate of change< norm rate of change > norm  filter trajectory of parameter as estimated from the signal time corrected trajectory

attenuation [dB] 10  110 modulation frequency [Hz] 0300 time [ms] RASTA filter

Cortical receptive field time-frequency distribution of the linear component of the most efficient stimulus that excites the given auditory neuron relatively long temporal extent and limited frequency range

Examples of cortical receptive fields (courtessy of Shihab Shamma)‏

Temporal extent of cortical receptive fields 0300 time [ms] Impulse response of optimized RASTA filter Average of the first two principal components (83% of variance) along temporal axis from about 180 cortical receptive fields (from D. Klein, unpublished)‏

Deriving temporal basis (FIR filters) (with Sarel van Vuuren (1998) and Fabio Valente (2005)) time frequency Phoneme-labeled data (hundreds of hours) linear discriminant analysis u helowrdlo Discriminant matrix: A set of FIR RASTA filters

LDA in time Impulse responses ~ 250 ms Frequency responses bandpass ~ 1-15 Hz time [ms] amplitude of impulse response impulse responses modulation frequency [Hz] log magnitude spectrum [dB] frequency responses ~250 ms

production perception engineering speech evolved to be perceived –human perception as the optimal receiver of speech good engineering may be consistent with biology signal effect effect ( signal )

u helowrdlo coarticulation Coarticulation (inertia of organs of speech production) human auditory perception

frequency 16 x 14 bands = 448 projections Emulation of cortical processing (MRASTA)‏ data 1 t0t D projections with variable resolutions time data 2 data N 32 2-D projections with variable resolutions peripheral processing (critical-band spectral analysis)‏

time [ms] 448 outer products (16 different temporal functions and 2 different frequency functions at 14 different frequencies)‏ time frequency one example of the 2-D projection (out of 448 possible)‏ fcfc frequency fcfc f c+1 f c-1 MRASTA emulation of cortical receptive fields All projections with zero means  robustness to linear distortions (multi-RASTA)‏

Alternative 2-D bases waveform PLPs STRF bank time Actual measured spectro-termporal cortical receptive fields from ferret (Nima Mesgarani) ‏ Outer product of variable length Gabor functions (Michael Kleinschmidt et al.)

peripheral-like processing higher-level (cortical-like)‏ processing classifier (trained neural net)‏ phoneme posteriors onenine threeseventwo /n//w/ /n/ /th//iy//s//v//n//uw/ /w/ /sil//ah/ /ay/ /sil//r/ /eh//th//sil//ah/ posteriogram most efficient (smallest) set of features are class posteriors (e.g. Fukunaga)‏ phoneme class feed-forward neural net TIME FREQUENCY 448 time-frequency projections (MRASTA)‏ ~ 1 s

pre-softmax outputs principal component projection to HMM neural network trained to estimate posteriors of speech classes measurements pre-processing histogram of one feature correlation matrix of features TANDEM Features for Conventional HMM System

Speech intelligibility Speech modulation spectrum peaks around 4 Hz Reverberation acts as low-pass filter in modulation domain Noise reduces depth of the modulation Speech Transmission Index (STI) [Houtgast and Steeneken 1980] Test signal at seven different carrier frequencies with varying modulation frequencies (figure from from H.M.N. Steeneken :The Measurement of Speech Intelligibility 1998 spectral envelope trajectoryenvelope spectrum original reverberant noisy time [s] modulation frequency [Hz]

Spectro-Temporal Modulation Index (STMF) M. Elihali D “ripple” signals instead of modulated sinusoids Frequency (kHz)‏ ∆A Time (ms)‏ Frequency w = 4 Hz 0 250

time ms ms 1-3 Bark Data for modulation spectrum processing

PEMO Analysis (Goetingen, Oldenburg)‏

time ms 1-3 Bark all-pole model of part of time-frequency plane Frequency-domain linear prediction (with Marios Athineos and Dan Ellis, 2003)

All-pole Model of Temporal Trajectory of Spectral Energy the signal signal power spectrum all-pole model of the power spectrum DCT of the signal Hilbert envelope of the signal all-pole model of the Hilbert envelope conventional LP spectral domain LP

signal discrete cosine transform low frequency high frequency prediction all-pole model of low- frequency Hilbert envelope all-pole model of high-frequency Hilbert envelope All-pole Models Of Sub-band Energy Contours

Critical-band Spectrum From FFT time tonality Critical-band Spectrum From All-pole Models Of Hilbert Envelopes in Critical Bands time tonality

Modulation spectrum processing in speech technology Feature extraction in ASR  Historical: H. Hermansky, “Speech beyond 20 ms: Speech Processing in Temporal Domain,” invited keynote lecture in Proceedings of the International Workshop on Human Interface Technology, Aizu, Japan, 1994  Recent:F. Valente and H. Hermansky, “Hierarchical Neural Networks Feature Extraction for LVCSR System”, in Proceedings of the International Conference on Spoken Language Processing, Antwerp, Belgium Speech intelligibility  Historical: Houtgast, T., and Steeneken, H.J.M. The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28: 66-73,1973.  Recent: M. Elhilali, Neural Basis and Computational Strategies forAuditory Processing, PhD. Thesis, University of Maryland 2004 Speech Coding  Historical: M. Athineos, H. Hermansky, and D. P. W. Ellis, "LP-TRAP: Linear Predictive Temporal Patterns," Proc. ICSLP, pp , Jeju, S. Korea, October 2004  Recent: P. Motlicek, V. Ullal and H. Hermansky, “Wide-Band Perceptual Audio Coding based on Frequency-Domain Linear Prediction,” in Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Honolulu, HI, 2007.

Thank You Conclusion of Part 3 (of 3) THE MODULATION SPECTRUM And Its Application to Speech Science and Technology