Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress Richard Stern (with Chanwoo Kim and Yu-Hsiang Chiu) Department.

Slides:



Advertisements
Similar presentations
Acoustic/Prosodic Features
Advertisements

SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Hearing relative phases for two harmonic components D. Timothy Ives 1, H. Martin Reimann 2, Ralph van Dinther 1 and Roy D. Patterson 1 1. Introduction.
Multipitch Tracking for Noisy Speech
Periodicity and Pitch Importance of fine structure representation in hearing.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
USING COMPUTATIONAL MODELS OF BINAURAL HEARING TO IMPROVE AUTOMATIC SPEECH RECOGNITION: Promise, Progress, and Problems Richard Stern Department of Electrical.
ROBUST SIGNAL REPRESENTATIONS FOR AUTOMATIC SPEECH RECOGNITION
Digital Representation of Audio Information Kevin D. Donohue Electrical Engineering University of Kentucky.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Digital Image Processing In The Name Of God Digital Image Processing Lecture3: Image enhancement M. Ghelich Oghli By: M. Ghelich Oghli
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Robust Automatic Speech Recognition In the 21 st Century Richard Stern (with Alex Acero, Yu-Hsiang Chiu, Evandro Gouvêa, Mark Harvilla, Chanwoo Kim, Kshitiz.
Despeckle Filtering in Medical Ultrasound Imaging
EE Audio Signals and Systems Psychoacoustics (Masking) Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Representing Acoustic Information
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Abstract We report comparisons between a model incorporating a bank of dual-resonance nonlinear (DRNL) filters and one incorporating a bank of linear gammatone.
All features considered separately are relevant in a speech / music classification task. The fusion allows to raise the accuracy rate up to 94% for speech.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Basics of Neural Networks Neural Network Topologies.
Studies of Information Coding in the Auditory Nerve Laurel H. Carney Syracuse University Institute for Sensory Research Departments of Biomedical & Chemical.
From last time …. ASR System Architecture Pronunciation Lexicon Signal Processing Probability Estimator Decoder Recognized Words “zero” “three” “two”
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Gammachirp Auditory Filter
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Hearing Research Center
Chapter 6. Effect of Noise on Analog Communication Systems
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
IIT Bombay {pcpandey,   Intro. Proc. Schemes Evaluation Results Conclusion Intro. Proc. Schemes Evaluation Results Conclusion.
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
Additivity of auditory masking using Gaussian-shaped tones a Laback, B., a Balazs, P., a Toupin, G., b Necciari, T., b Savel, S., b Meunier, S., b Ystad,
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
APPLICATION OF A WAVELET-BASED RECEIVER FOR THE COHERENT DETECTION OF FSK SIGNALS Dr. Robert Barsanti, Charles Lehman SSST March 2008, University of New.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo January 15, 2015 Department of Electrical and Computer.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 20,
SOUND PRESSURE, POWER AND LOUDNESS
UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.
Speech Enhancement Summer 2009
PATTERN COMPARISON TECHNIQUES
PSYCHOACOUSTICS A branch of psychophysics
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Ana Alves-Pinto, Joseph Sollini, Toby Wells, and Christian J. Sumner
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Digital Systems: Hardware Organization and Design
Human Speech Communication
Dealing with Acoustic Noise Part 1: Spectral Estimation
INTRODUCTION TO THE SHORT-TIME FOURIER TRANSFORM (STFT)
INTRODUCTION TO ADVANCED DIGITAL SIGNAL PROCESSING
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Robust Speech Recognition in the 21st Century
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Applying Models of Auditory Processing to Automatic Speech Recognition: Promise and Progress Richard Stern (with Chanwoo Kim and Yu-Hsiang Chiu) Department of Electrical and Computer Engineering and Language Technologies Institute Carnegie Mellon University Pittsburgh, Pennsylvania Telephone: ; FAX: Universidad de Zaragoza May 28, 2014

Slide 2ECE and LTI Robust Speech Group Introduction – auditory processing and automatic speech recognition I was originally trained in auditory perception, and my original work was in binaural hearing Over the past years, I have been spending the bulk of my time trying to improve the accuracy of automatic speech recognition systems in difficult acoustical environments In this talk I would like to discuss some of the ways in my group (and many others) have been attempting to apply knowledge of auditory perception to improve ASR accuracy –Comment: approaches can be more or less faithful to physiology and psychophysics

Slide 3ECE and LTI Robust Speech Group The big questions …. How can knowledge of auditory physiology and perception improve speech recognition accuracy? Can speech recognition results tell us anything we don’t already know about auditory processing? What aspects of the processing are most valuable for robust feature extraction?

Slide 4ECE and LTI Robust Speech Group So what I will do is …. Briefly review some of the major physiological and psychophysical results that motivate the models Briefly review and discuss the major “classical” auditory models of the 1980s –Seneff, Lyon, and Ghitza Review some of the major new trends in today’s models Talk about some representative issues that have driven work as of late at CMU and what we have learned from them

Slide 5ECE and LTI Robust Speech Group Basic auditory anatomy Structures involved in auditory processing:

Slide 6ECE and LTI Robust Speech Group Central auditory pathways There is a lot going on! For the most part, we only consider the response of the auditory nerve –It is in series with everything else

Slide 7ECE and LTI Robust Speech Group Excitation along the basilar membrane

Slide 8ECE and LTI Robust Speech Group Transient response of auditory-nerve fibers Histograms of response to tone bursts (Kiang et al., 1965): Comment: Onsets and offsets produce overshoot

Slide 9ECE and LTI Robust Speech Group Frequency response of auditory-nerve fibers: tuning curves Threshold level for auditory-nerve response to tones: Note dependence of bandwidth on center frequency and asymmetry of response

Slide 10ECE and LTI Robust Speech Group Typical response of auditory-nerve fibers as a function of stimulus level Typical response of auditory-nerve fibers to tones as a function of intensity: Comment: –Saturation and limited dynamic range

Slide 11ECE and LTI Robust Speech Group Synchronized auditory-nerve response to low-frequency tones Comment: response remains synchronized over a wide range of intensities

Slide 12ECE and LTI Robust Speech Group Comments on synchronized auditory response Nerve fibers synchronize to fine structure at “low” frequencies, signal envelopes at high frequencies Synchrony clearly important for auditory localization Synchrony could be important for monaural processing of complex signals as well

Slide 13ECE and LTI Robust Speech Group Lateral suppression in auditory processing Auditory-nerve response to pairs of tones: Comment: Lateral suppression enhances local contrast in frequency

Slide 14ECE and LTI Robust Speech Group Auditory frequency selectivity: critical bands Measurements of psychophysical filter bandwidth by various methods: Comments: –Bandwidth increases with center frequency –Solid curve is “Equivalent Rectangular Bandwidth” (ERB)

Slide 15ECE and LTI Robust Speech Group Three perceptual auditory frequency scales Bark scale: (DE) Mel scale: (USA) ERB scale: (UK)

Slide 16ECE and LTI Robust Speech Group Comparison of normalized perceptual frequency scales Bark scale (in blue), Mel scale (in red), and ERB scale (in green):

Slide 17ECE and LTI Robust Speech Group Forward and backward masking Masking can have an effect even if target and masker are not simultaneously presented Forward masking - masking precedes target Backward masking - target precedes masker Examples:

Slide 18ECE and LTI Robust Speech Group The loudness of sounds Equal loudness contours (Fletcher-Munson curves):

Slide 19ECE and LTI Robust Speech Group Summary of basic auditory physiology and perception Major monaural physiological attributes: –Frequency analysis in parallel channels –Preservation of temporal fine structure –Limited dynamic range in individual channels –Enhancement of temporal contrast (at onsets and offsets) –Enhancement of spectral contrast (at adjacent frequencies) Most major physiological attributes have psychophysical correlates Most physiological and psychophysical effects are not preserved in conventional representations for speech recognition

Slide 20ECE and LTI Robust Speech Group Default signal processing: Mel frequency cepstral coefficients (MFCCs) Comment: 20-ms time slices are modeled by smoothed spectra, with attention paid to auditory frequency selectivity

Slide 21ECE and LTI Robust Speech Group What the speech recognizer sees …. An original spectrogram: Spectrum “recovered” from MFCC:

Slide 22ECE and LTI Robust Speech Group Comments on the MFCC representation It’s very “blurry” compared to a wideband spectrogram! Aspects of auditory processing represented: – Frequency selectivity and spectral bandwidth (but using a constant analysis window duration!) »Wavelet schemes exploit time-frequency resolution better –Nonlinear amplitude response Aspects of auditory processing NOT represented: –Detailed timing structure –Lateral suppression –Enhancement of temporal contrast –Other auditory nonlinearities

Slide 23ECE and LTI Robust Speech Group Auditory models in the 1980s: the Seneff model Overall model: Detail of Stage II: –An early well-known auditory model –In addition to mean rate, used “Generalized Synchrony Detector” to extract synchrony

Slide 24ECE and LTI Robust Speech Group Auditory models in the 1980s: Ghitza’s EIH model –Estimated timing information from ensembles of zero crossings with different thresholds

Slide 25ECE and LTI Robust Speech Group Auditory models in the 1980s: Lyon’s auditory model Single stage of the Lyon auditory model: –Lyon model included nonlinear compression, lateral suppression, temporal effects –Also added correlograms (autocorrelation and crosscorrelation of model outputs)

Slide 26ECE and LTI Robust Speech Group Perceptual Linear Prediction (Hermansky): pragmatic application of auditory principles The PLP algorithm is a pragmatic approach to model the auditory periphery Includes: –Resampling for frequency warping –Limited frequency resolution –Pre-emphasis according to threshold of hearing –Amplitude compression –Smoothing using linear prediction Computational complexity similar to LPCC, MFCC Sometimes provides better recognition accuracy

Slide 27ECE and LTI Robust Speech Group PLP versus MFCC processing Comments: –Extra blocks in PLP motivated by additional physiological phenomena –Implementation differences, even in “common blocks”

Slide 28ECE and LTI Robust Speech Group Auditory modeling was expensive: Computational complexity of Seneff model Number of multiplications per ms of speech (from Ohshima and Stern, 1994):

Slide 29ECE and LTI Robust Speech Group Summary: early auditory models The models developed in the 1980s included: –“Realistic” auditory filtering –“Realistic” auditory nonlnearity –Synchrony extraction –Lateral suppression –Higher order processing through auto-correlation and cross-correlation Every system developer had his or her own idea of what was important

Slide 30ECE and LTI Robust Speech Group Evaluation of early auditory models (Ohshima and Stern, 1994) Not much quantitative evaluation actually performed General trends of results: –Physiological processing did not help much (if at all) for clean speech –More substantial improvements observed for degraded input –Benefits generally do not exceed what could be achieved with more prosaic approaches (e.g. CDCN/VTS in our case).

Slide 31ECE and LTI Robust Speech Group Other reasons why work on auditory models subsided in the late 1980s … Failure to obtain a good statistical match between characteristics of features and speech recognition system –Ameliorated by subsequent development of continuous HMMs More pressing need to solve other basic speech recognition problems

Slide 32ECE and LTI Robust Speech Group Renaissance in the 1990s! By the late 1990s, physiologically-motivated and perceptually- motivated approaches to signal processing began to flourish Some major new trends …. Computation no longer such a limiting factor Serious attention to temporal evolution Attention to reverberation Binaural processing More effective and mature approaches to information fusion

Slide 33ECE and LTI Robust Speech Group Peripheral auditory modeling at CMU 2004–now Foci of activities: –Representing synchrony –The shape of the rate-intensity function –Revisiting analysis duration –Revisiting frequency resolution –Onset enhancement –Modulation filtering –Binaural and “polyaural” techniques –Auditory scene analysis: common frequency modulation

Slide 34ECE and LTI Robust Speech Group Speech representation using mean rate Representation of vowels by Young and Sachs using mean rate: Mean rate representation does not preserve spectral information

Slide 35ECE and LTI Robust Speech Group Speech representation using average localized synchrony rate Representation of vowels by Young and Sachs using ALSR:

Slide 36ECE and LTI Robust Speech Group The importance of timing information Re-analysis of Young-Sachs data by Secker-Walker and Searle: Temporal processing captures dominant formants in a spectral region

Slide 37ECE and LTI Robust Speech Group Physiologically-motivated signal processing: the Zhang-Carney model of the periphery We used the “synapse output” as the basis for further processing

Slide 38ECE and LTI Robust Speech Group Physiologically-motivated signal processing: synchrony and mean-rate detection (Kim/Chiu ‘06) Synchrony response is smeared across frequency to remove pitch effects Higher frequencies represented by mean rate of firing Synchrony and mean rate combined additively Much more processing than MFCCs

Slide 39ECE and LTI Robust Speech Group Comparing auditory processing with cepstral analysis: clean speech Original spectrogram MFCC reconstruction Auditory analysis

Slide 40ECE and LTI Robust Speech Group Comparing auditory processing with cepstral analysis: 20-dB SNR

Slide 41ECE and LTI Robust Speech Group Comparing auditory processing with cepstral analysis: 10-dB SNR

Slide 42ECE and LTI Robust Speech Group Comparing auditory processing with cepstral analysis: 0-dB SNR

Slide 43ECE and LTI Robust Speech Group Auditory processing is more effective than MFCCs at low SNRs, especially in white noise Curves are shifted by dB (greater improvement than obtained with VTS or CDCN) [Results from Kim et al., Interspeech 2006] Accuracy in background noise: Accuracy in background music:

Slide 44ECE and LTI Robust Speech Group Do auditory models really need to be so complex? Model of Zhang et al. 2001: A much simpler model:

Slide 45ECE and LTI Robust Speech Group Comparing simple and complex auditory models Comparing MFCC processing, a trivial (filter–rectify–compress) auditory model, and the full Carney-Zhang model (Chiu 2006):

Slide 46ECE and LTI Robust Speech Group The nonlinearity seems to be the most important attribute of the Seneff model (Chiu ‘08)

Slide 47ECE and LTI Robust Speech Group Why the nonlinearity seems to help …

Slide 48ECE and LTI Robust Speech Group Impact of auditory nonlinearity (Chiu) MFCC Learned nonlinearity Baseline nonlinearity

Slide 49ECE and LTI Robust Speech Group The nonlinearity in PNCC processing (Kim)

Slide 50ECE and LTI Robust Speech Group Frequency resolution Examined several types of frequency resolution –MFCC triangular filters –Gammatone filter shapes –Truncated Gammatone filter shapes Most results do not depend greatly on filter shape Some sort of frequency integration is helpful when frequency- based selection algorithms are used

Slide 51ECE and LTI Robust Speech Group Analysis window duration (Kim) Typical analysis window duration for speech recognition is ~25-35 ms Optimal analysis window duration for estimation of environmental parameters is ~ ms Best systems measure environmental parameters (including voice activity detection over a longer time interval but apply results to a short-duration analysis frame

Slide 52ECE and LTI Robust Speech Group Temporal Speech Properties: modulation filtering Output of speech and noise segments from 14 th Mel filter (1050 Hz) Speech segment exhibits greater fluctuations 52

Slide 53ECE and LTI Robust Speech Group Nonlinear noise processing Use nonlinear cepstral highpass filtering to pass speech but not noise Why nonlinear? –Need to keep results positive because we are dealing with manipulations of signal power

Slide 54ECE and LTI Robust Speech Group Asymmetric lowpass filtering (Kim, 2010) Overview of processing: –Assume that noise components vary slowly compared to speech components –Obtain a running estimate of noise level in each channel using nonlinear processing –Subtract estimated noise level from speech An example: Note: Asymmetric highpass filtering is obtained by subtracting the lowpass filter output from the input

Slide 55ECE and LTI Robust Speech Group Implementing asymmetric lowpass filtering Basic equation: Dependence on parameter values:

Slide 56ECE and LTI Robust Speech Group Block diagram of noise processing using asymmetric filters Speech-nonspeech partitions are obtained based on a simple threshold test Asymmetric highpass filtering is always applied; asymmetric lowpass filtering is applied only to non-speech segments

Slide 57ECE and LTI Robust Speech Group PNCC processing (Kim and Stern, 2010,2014) A pragmatic implementation of a number of the principles described: –Gammatone filterbanks –Nonlinearity shaped to follow auditory processing –“Medium-time” environmental compensation using nonlinearity cepstral highpass filtering in each channel –Enhancement of envelope onsets –Computationally efficient implementation

Slide 58ECE and LTI Robust Speech Group An integrated front end: power-normalized cepstral coefficients (PNCC, Kim ‘10)

Slide 59ECE and LTI Robust Speech Group An integrated front end: power-normalized cepstral coefficients (PNCC, Kim ‘10)

Slide 60ECE and LTI Robust Speech Group An integrated front end: power-normalized cepstral coefficients (PNCC, Kim ‘10)

Slide 61ECE and LTI Robust Speech Group Computational complexity of front ends

Slide 62ECE and LTI Robust Speech Group Performance of PNCC in white noise (RM)

Slide 63ECE and LTI Robust Speech Group Performance of PNCC in white noise (WSJ)

Slide 64ECE and LTI Robust Speech Group Performance of PNCC in background music

Slide 65ECE and LTI Robust Speech Group Performance of PNCC in reverberation

Slide 66ECE and LTI Robust Speech Group Contributions of PNCC components: white noise (WSJ) + Temporal masking + Noise protection + Medium-time time/freq anal Baseline MFCC with CMN + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN

Slide 67ECE and LTI Robust Speech Group Contributions of PNCC components: background music (WSJ) + Temporal masking + Noise protection + Medium-time time/freq anal Baseline MFCC with CMN + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN

Slide 68ECE and LTI Robust Speech Group Contributions of PNCC components: reverberation (WSJ) + Temporal masking + Noise protection + Medium-time time/freq anal Baseline MFCC with CMN + Temporal masking + Noise suppression + Medium-duration processing Baseline MFCC + CMN

Slide 69ECE and LTI Robust Speech Group Effects of onset enhancement/temporal masking (SSF processing, Kim ‘10)

Slide 70ECE and LTI Robust Speech Group PNCC and

Slide 71ECE and LTI Robust Speech Group Summary … so what matters? Knowledge of the auditory system can certainly improve ASR accuracy: –Use of synchrony –Consideration of rate-intensity function –Onset enhancement –Nonlinear modulation filtering –Selective reconstruction –Consideration of processes mediating auditory scene analysis

Slide 72ECE and LTI Robust Speech Group Summary PNCC processing includes –More effective nonlinearity –Parameter estimation for noise compensation and analysis based on longer analysis time and frequency spread –Efficient noise compensation based on modulation filtering –Onset enhancement –Computationally-efficient implementation Not considered yet … –Synchrony representation –Lateral suppression