Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

Slides:

Advertisements

Similar presentations

Alex Chen Nader Shehad Aamir Virani Erik Welsh

Advertisements

Revised estimates of human cochlear tuning from otoacoustic and behavioral measurements Christopher A. Shera, John J. Guinan, Jr., and Andrew J. Oxenham.

Audio Compression ADPCM ATRAC (Minidisk) MPEG Audio –3 layers referred to as layers I, II, and III –The third layer is mp3.

Time-Frequency Analysis Analyzing sounds as a sequence of frames

Hearing relative phases for two harmonic components D. Timothy Ives 1, H. Martin Reimann 2, Ralph van Dinther 1 and Roy D. Patterson 1 1. Introduction.

Multipitch Tracking for Noisy Speech

Sound source segregation Development of the ability to separate concurrent sounds into auditory objects.

An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.

CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.

A glimpsing model of speech perception

Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.

A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.

1 Audio Compression Techniques MUMT 611, January 2005 Assignment 2 Paul Kolesnik.

Using Fo and vocal-tract length to attend to one of two talkers. Chris Darwin University of Sussex With thanks to : Rob Hukin John Culling John Bird MRC.

Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.

Effect of roving on spatial release from masking for amplitude-modulated noise stimuli Norbert Kopčo *, Jaclyn J. Jacobson, and Barbara Shinn-Cunningham.

Efficient Coding of Natural Sounds Grace Wang HST 722 Topic Proposal.

Sound source segregation (determination)

Presented by Dr.J.L Mazher Iqbal

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

1 Lecture 9: Diversity Chapter 7 – Equalization, Diversity, and Coding.

1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.

1 Mixers  Mixers plays an important role in both the transmitter and the receiver  Mixers are used for down frequency conversion in the receiver  Mixers.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Audio Scene Analysis and Music Cognitive Elements of Music Listening

Sensitivity System sensitivity is defined as the available input signal level Si for a given (SNR)O Si is called the minimum detectable signal An expression.

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Review of Ultrasonic Imaging

1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.

Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

Gammachirp Auditory Filter

Applied Psychoacoustics Lecture 3: Masking Jonas Braasch.

Hearing Research Center

Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.

P. N. Kulkarni, P. C. Pandey, and D. S. Jangamashetti / DSP 2009, Santorini, 5-7 July DSP 2009 (Santorini, Greece. 5-7 July 2009), Session: S4P,

IIT Bombay {pcpandey,   Intro. Proc. Schemes Evaluation Results Conclusion Intro. Proc. Schemes Evaluation Results Conclusion.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.

Autonomous Robots Vision © Manfred Huber 2014.

Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.

Machine Vision Edge Detection Techniques ENT 273 Lecture 6 Hema C.R.

Fletcher’s band-widening experiment (1940)

The role of reverberation in release from masking due to spatial separation of sources for speech identification Gerald Kidd, Jr. et al. Acta Acustica.

What can we expect of cochlear implants for listening to speech in noisy environments? Andrew Faulkner: UCL Speech Hearing and Phonetic Sciences.

ARENA08 Roma June 2008 Francesco Simeone (Francesco Simeone INFN Roma) Beam-forming and matched filter techniques.

Speech Enhancement Algorithm for Digital Hearing Aids

Speech and Singing Voice Enhancement via DNN

Precedence-based speech segregation in a virtual auditory environment

Consistent and inconsistent interaural cues don't differ for tone detection but do differ for speech recognition Frederick Gallun Kasey Jakien Rachel Ellinger.

Ana Alves-Pinto, Joseph Sollini, Toby Wells, and Christian J. Sumner

Cocktail Party Problem as Binary Classification

DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark

Josh H. McDermott, Eero P. Simoncelli Neuron

EE513 Audio Signals and Systems

Consequences of the Oculomotor Cycle for the Dynamics of Perception

Consequences of the Oculomotor Cycle for the Dynamics of Perception

Stephen V. David, Benjamin Y. Hayden, James A. Mazer, Jack L. Gallant

Perception & Neurodynamics Lab

Presentation transcript:

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA

Outline of presentation l Background l Ideal binary time-frequency mask l Speech masking in perception l Three experiments on ideal binary masking with normal- hearing listeners l Two on multitalker mixtures l One on speech-noise mixtures

Auditory scene analysis (Bregman’90) l Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source l Ball-room problem, Helmholtz, 1863 (“complicated beyond conception”) l Cocktail-party problem (Cherry’53): The challenge of constructing a machine that has cocktail-party processing capability l Two conceptual processes of auditory scene analysis (ASA): l Segmentation. Decompose the acoustic mixture into sensory elements (segments) l Grouping. Combine segments into groups (streams), so that segments in the same group likely originate from the same environmental source

Computational auditory scene analysis l Computational ASA (CASA) systems approach sound separation based on ASA principles l Different from traditional sound separation approaches, such as speech enhancement, beamforming with a sensor array, and independent component analysis

Ideal binary mask as the putative goal of CASA l Key idea is to retain parts of a target sound that are stronger than the acoustic background, or to mask interference by the target l What a target is depends on intention, attention, etc. l Within a local time-frequency (T-F) unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise (Hu & Wang’01; Roman et al.’03) l It does not actually separate the mixture! l Local 0-dB SNR criterion for mask generation l Earlier studies use binary masks as an output representation (Brown & Cooke’94; Wang and Brown’99; Roweis’00), but do not suggest the explicit notion of the ideal binary mask

Ideal binary mask illustration

Masking not as discontinuous as it appears

Resemblance to visual occlusion

Properties of ideal binary masks l Consistent with the auditory masking phenomenon l Drullman (1995) finds no intelligibility difference whether noise is removed or kept in target-stronger T-F regions l Optimality: The ideal binary mask is the optimal binary mask from the perspective of SNR gain l Flexibility: With the same mixture, the definition leads to different masks depending on what target is l Well-definedness: An ideal mask is well-defined no matter how many intrusions are in the scene or how many targets need to be segregated l Ideal binary masks provide a highly effective front-end for automatic speech recognition (Cooke et al.’01; Roman et al.’03) l ASR performance degrades gradually with deviations from the ideal mask (Roman et al.’03)

Speech-on-speech masking Speech masking: A target speech signal is overwhelmed by a competing speech signal, causing degraded intelligibility of the target speech by a listener Energetic masking Spectral overlap of target and interfering speech, making the target inaudible Competition at the periphery of the auditory system Informational masking Target and interference are both audible, but the listener is unable to hear the target Closely related with ASA: Voice characteristics, spatial cues, etc.

Isolating informational masking Energetic and informational masking coexist in speech perception, making it difficult to study one form of masking Brungart and Simpson (2002) isolate informational masking using across-ear effect Arbogast et al. (2002) divide speech signal into envelope modulated sine waves, or separate frequency bands

Isolating energetic masking The ideal binary mask provides a potential methodology to remove informational masking, hence isolating energetic masking Eliminate portions of the target dominated by interfering speech, hence accounting for the loss of target information due to energetic masking Retain only acoustically detectable portions of target speech Perform “ideal” time-frequency segregation, hence eliminating informational masking

Ideal mask methodology Process the original target speech and masker(s) signals through a bank of fourth-order gammatone filters (Patterson et al.’88), resulting in the cochleagram representation Generate the ideal mask matrix by comparing target and masker energy at each T-F unit of the filter output before mixing Criteria other than 0 dB LC are possible Synthesize new speech stimulus based on the resulting mask of a matrix of binary weights, and the gammatone output of the speech mixture

Cochleagram: Auditory peripheral model Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleagram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cubic root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent Widely used in CASA Spectrogram Cochleagram

Effects of local SNR criteria Positive LC (local SNR criterion) values Only retain T-F units where target is strong relative to interference Further remove target information, caused by the energetic masking by the interference As a result, the target signal would become less audible –Performance degradation due to energetic masking by the interfering signal as T-F units with not-so-strong target energy are removed Performance would show “true” energetic effects without confounding with informational masking

Effects of local SNR criteria Negative LC values Retain more T-F units in a mixture, even those units where the target is “very” weak compared to the masker Build up the effects of informational masking by the interference because the processing retains units where interference is audible and becomes stronger than the target Performance would degrade, and it would be interesting to see at what point the performance becomes equal that of the original mixture

“Ready Baron go to blue 1 now”“Ready Ringo go to white 4 now” Original ideal mask – 0 dB LC

Varying LC values Positive 12-dB LC corresponds to each T-F unit being assigned “1” if the target energy in that unit is 12 dB greater than interference energy and “0” otherwise

Experimental setup Two, three, or four simultaneous talkers. One of them is the target utterance. All the talkers are normalized to be equally loud, or 0 dB target-to-masker ratio (TMR = 0 dB) Nine listeners with normal hearing Stimuli: CRM (coordinate response measure) corpus Form: “Ready (call sign) go to (color) (number) now” Call Signs: “arrow”, “BARON”, “charlie”, “eagle”, “hopper,” “laker”, “ringo”, “tiger” Colors: “blue”, “green”, “red”, “white” Numbers: 1 through 8 Target phrase contains the call sign “Baron” and masking phrase contains a randomly selected call sign other than “Baron”

Experiment 1 Experiment 1 uses same-talker utterances Typical stimulus: 2-talkers (2-utterances)

Experiment 1 results 4-T 2-T 3-T

Three distinct regions of performance Region I: Positive LC – Masking by removing target energy: Energetic masking Each ΔdB increase above 0 dB in LC eliminates the same T-F units as fixing LC to 0 dB while reducing overall SNR by ΔdB Hence the performance in Region I indicates the effect of energetic masking on multitalker speech perception with the corresponding reduction of overall SNR Region II: Near perfect performance for LC from -12 dB LC to 0 dB, centering at -6 dB Not centering at 0 dB – the optimal LC from the SNR gain standpoint Region III: Below -12 dB LC – Masking by adding back interference: Informational masking

Error analysis for the two-talker case Supporting the hypothesis that Region I errors are due to energetic masking and Region III errors are due to informational masking

Experiment 2 Interfering speech signal was from the same talker, same-sex talker(s), or different-sex talker(s) compared to the target signal What portion of the release from masking is attributed to energetic and informational masking when there are different characteristics between target and masker?

Experiment 2 results

Experiment 3: Speech perception in noise What effect does the ideal binary mask have on the intelligibility of speech in continuous noise? Masking by continuous noise is considered primarily energetic masking Two types of noise were employed: speech-shaped noise and speech-modulated noise (to further match the envelope of a nontarget phrase) Two methods of ideal mask generation to test the equivalence between varying overall SNR and varying corresponding LC values Method 1: Fix overall SNR to 0 dB while varying LC in the positive range Method 2: Fix LC to 0 dB while varying overall SNR in the negative range

Experiment 3 results Methods 1 and 2 produce very similar results, supporting the equivalence of varying overall SNR and LC values Benefit from ideal binary masking (2-5 dB) is much smaller than with speech maskers Consistent with the hypothesis that ideal masking mainly removes informational masking

Conclusions from experiments Applying the ideal binary mask (or ideal T-F segregation) leads to dramatic increase in speech intelligibility in multitalker conditions Informational masking effects dominate performance in the CRM task Similarities between the voice characteristics of the target and interfering talkers have minor effect on energetic masking Continuous noise masker results in a much greater increase in energetic masking In this case, the ideal binary mask leads to smaller performance gain compared to multitalker situations

Limitations and related work The small lexicon of the CRM corpus. Tests with larger vocabulary corpus are needed for firmer conclusions Non-simultaneous masking is not considered Performance on hearing-impaired listeners?

What about hearing-impaired listeners? Anzalone et al. (2006) recently tested a different version of the ideal binary mask on both normal-hearing and hearing-impaired listeners Their tests use HINT sentences mixed with speech- shaped noise Ideal masking leads to 9 dB SRT (speech reception threshold) reduction for hearing impaired listeners (left) and more than 7 dB for normal hearing listeners Hearing impaired listeners are not as sensitive to binary processing artifacts compared to normal hearing listeners

Acknowledgment l Joint work with Douglas Brungart, Peter Chang, and Brian Simpson l Subject of a 2006 JASA paper