Cocktail Party Problem as Binary Classification

Slides:

Advertisements

Similar presentations

Auditory scene analysis 2

Advertisements

Advanced Speech Enhancement in Noisy Environments

Multipitch Tracking for Noisy Speech

An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.

Room Acoustics: implications for speech reception and perception by hearing aid and cochlear implant users 2003 Arthur Boothroyd, Ph.D. Distinguished.

CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.

Image Analysis Phases Image pre-processing –Noise suppression, linear and non-linear filters, deconvolution, etc. Image segmentation –Detection of objects.

Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.

Audio Scene Analysis and Music Cognitive Elements of Music Listening

By Sarita Jondhale1 Pattern Comparison Techniques.

Perception Introduction Pattern Recognition Image Formation

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.

Speech Enhancement Using Spectral Subtraction

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.

Performance Comparison of Speaker and Emotion Recognition

Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.

UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.

Speech Enhancement Algorithm for Digital Hearing Aids

Speech and Singing Voice Enhancement via DNN

Speech Enhancement Summer 2009

Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.

Supervised Speech Separation

4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies

Precedence-based speech segregation in a virtual auditory environment

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.

Liverpool Keele Contribution.

Fondazione Istituto Italiano di Tecnologia, Genoa, Italy

Information-Theoretic Listening

Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing

DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark

EE513 Audio Signals and Systems

Speech Perception (acoustic cues)

A maximum likelihood estimation and training on the fly approach

Perception & Neurodynamics Lab

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Auditory Morphing Weyni Clacken

Presentation transcript:

Cocktail Party Problem as Binary Classification DeLiang Wang Perception & Neurodynamics Lab Ohio State University

Outline of presentation Cocktail party problem Computational theory analysis Ideal binary mask Speech intelligibility tests Unvoiced speech segregation as binary classification

Real-world audition What? Speech message speaker Music Car passing by age, gender, linguistic origin, mood, … Music Car passing by Where? Left, right, up, down How close? Channel characteristics Environment characteristics Room reverberation Ambient noise

Sources of intrusion and distortion additive noise from other sound sources channel distortion reverberation from surface reflections

Cocktail party problem Term coined by Cherry “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry’57) “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal-hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp’92) Ball-room problem by Helmholtz “Complicated beyond conception” (Helmholtz, 1863) Speech segregation problem

Approaches to Speech Segregation Problem Speech enhancement Enhance signal-to-noise ratio (SNR) or speech quality by attenuating interference. Applicable to monaural recordings Limitation: Stationarity and estimation of interference Spatial filtering (beamforming) Extract target sound from a specific spatial direction with a sensor array Limitation: Configuration stationarity. What if the target switches or changes location? Independent component analysis (ICA) Find a demixing matrix from mixtures of sound sources Limitation: Strong assumptions. Chief among them is stationarity of mixing matrix “No machine has yet been constructed to do just that [solving the cocktail party problem].” (Cherry’57)

Auditory scene analysis Listeners parse the complex mixture of sounds arriving at the ears in order to form a mental representation of each sound source This perceptual process is called auditory scene analysis (Bregman’90) Two conceptual processes of auditory scene analysis (ASA): Segmentation. Decompose the acoustic mixture into sensory elements (segments) Grouping. Combine segments into groups, so that segments in the same group likely originate from the same environmental source

Computational auditory scene analysis Computational auditory scene analysis (CASA) approaches sound separation based on ASA principles Feature based approaches Model based approaches

Outline of presentation Cocktail party problem Computational theory analysis Ideal binary mask Speech intelligibility tests Unvoiced speech segregation as binary classification

What is the goal of CASA? What is the goal of perception? The perceptual systems are ways of seeking and extracting information about the environment from sensory input (Gibson’66) The purpose of vision is to produce a visual description of the environment for the viewer (Marr’82) By analogy, the purpose of audition is to produce an auditory description of the environment for the listener What is the computational goal of ASA? The goal of ASA is to segregate sound mixtures into separate perceptual representations (or auditory streams), each of which corresponds to an acoustic event (Bregman’90) By extrapolation the goal of CASA is to develop computational systems that extract individual streams from sound mixtures

Marrian three-level analysis According to Marr (1982), a complex information processing system must be understood in three levels Computational theory: goal, its appropriateness, and basic processing strategy Representation and algorithm: representations of input and output and transformation algorithms Implementation: physical realization All levels of explanation are required for eventual understanding of perceptual information processing Computational theory analysis – understanding the character of the problem – is critically important

Computational-theory analysis of ASA To form a stream, a sound must be audible on its own The number of streams that can be computed at a time is limited Magical number 4 for simple sounds such as tones and vowels (Cowan’01)? 1+1, or figure-ground segregation, in noisy environment such as a cocktail party? Auditory masking further constrains the ASA output Within a critical band a stronger signal masks a weaker one

Computational-theory analysis of ASA (cont.) ASA outcome depends on sound types (overall SNR is 0) Noise-Noise: pink , white , pink+white Tone-Tone: tone1 , tone2 , tone1+tone2 Speech-Speech: Noise-Tone: Noise-Speech: Tone-Speech:

Some alternative CASA goals Extract all underlying sound sources or the target sound source (the gold standard) Implicit in speech enhancement, spatial filtering, and ICA Segregating all sources is implausible, and probably unrealistic with one or two microphones Enhance automatic speech recognition (ASR) Close coupling with a primary motivation of speech segregation Perceiving is more than recognizing (Treisman’99) Enhance human listening Advantage: close coupling with auditory perception There are applications that involve no human listening

Ideal binary mask as CASA goal Motivated by above analysis, we have suggested the ideal binary mask as a main goal of CASA (Hu & Wang’01, ’04) Key idea is to retain parts of a target sound that are stronger than the acoustic background, or discard the rest What a target is depends on intention, attention, etc. The definition of the ideal binary mask (IBM) s(t, f ): Target energy in unit (t, f ) n(t, f ): Noise energy θ: A local SNR criterion (LC) in dB, which is typically chosen to be 0 dB It does not actually separate the mixture!

IBM illustration

Properties of IBM Flexibility: With the same mixture, the definition leads to different IBMs depending on what target is Well-definedness: IBM is well-defined no matter how many intrusions are in the scene or how many targets need to be segregated Consistent with computational-theory analysis of ASA Audibility and capacity Auditory masking Effects of target and noise types Optimality: Under certain conditions the ideal binary mask with θ = 0 dB is the optimal binary mask from the perspective of SNR gain The ideal binary mask provides an excellent front-end for robust ASR (Cooke et al.’01; Roman et al.’03)

Subject tests of ideal binary masking Recent studies found large speech intelligibility improvements by applying ideal binary masking for normal-hearing (Brungart et al.’06; Li & Loizou’08), and hearing-impaired (Anzalone et al.’06; Wang et al.’09) listeners Improvement for stationary noise is above 7 dB for normal-hearing (NH) listeners, and above 9 dB for hearing-impaired (HI) listeners Improvement for modulated noise is significantly larger than for stationary noise Brungart and Simpson (2001)’s experiments show coexistence of informational and energetic masking. They have also isolated informational masking using across-ear effect when listening to two talkers in one ear experience informational masking from third signal in the opposite ear (Brungart and Simpson, 2002) Arbogast et al. (2002) divide speech signal into 15 log spaced, envelope modulated sinewaves. Assign some to target and some to interference: intelligible speech with no spectral overlaps

Test conditions of Wang et al.’09 SSN: Unprocessed monaural mixtures of speech-shaped noise (SSN) and Dantale II sentences (0 dB: -10 dB: ) CAFÉ: Unprocessed monaural mixtures of cafeteria noise (CAFÉ) and Dantale II sentences (0 dB: -10 dB: ) SSN-IBM: IBM applied to SSN (0 dB: -10 dB: -20 dB: ) CAFÉ-IBM: IBM applied to CAFÉ (0 dB: -10 dB: -20 dB: ) Intelligibility results are measured in terms of speech reception threshold (SRT), the required SNR level for 50% intelligibility score

Wang et al.’s results 12 NH subjects (10 male and 2 female), and 12 HI subjects (9 male and 3 female) SRT means for the 4 conditions for NH listeners: (-8.2, -10.3, -15.6, -20.7) SRT means for the 4 conditions for HI listeners: (-5.6, -3.8, -14.8, -19.4)

Speech perception of noise with binary gains Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech) Create a continuous noise signal that match long-term average spectrum of speech Further approach to divide “speech-spectrum-shaped” noise into number of bands and modulate with natural speech envelopes However, noise signal too “speech-like” and reintroduce informational masking (Brungart et al. 2004) Alternatively, use Ideal Binary Mask to remove informational masking Eliminate portions of stimulus dominated by the interfering speech. Retain only portions of target speech acoustically detectable in the presence of interfering speech

Wang et al.’08 results Mean numbers for the 4 conditions: (97.1%, 92.9%, 54.3%, 7.6%) Despite a great reduction of spectrotemporal information, a pattern of binary gains is apparently sufficient for human speech recognition

Interim summary Ideal binary mask is an appropriate computational goal of auditory scene analysis in general, and speech segregation in particular Hence solving the cocktail party problem would amount to binary classification This formulation opens the problem to a variety of pattern classification methods

Outline of presentation Cocktail party problem Computational theory analysis Ideal binary mask Speech intelligibility tests Unvoiced speech segregation as binary classification

Unvoiced speech Speech sounds consist of vowels and consonants; consonants further consist of voiced and unvoiced consonants For English, unvoiced speech sounds come from the following consonant categories: Stops (plosives) Unvoiced: /p/ (pool), /t/ (tool), and /k/ (cake) Voiced: /b/ (book), /d/ (day), and /g/ (gate) Fricatives Unvoiced: /s/(six), /sh/ (sheep), /f/ (fix), and /th/ (this) Voiced: /z/ (zoo), /zh/ (pleasure), /v/ (vine), and /dh/ (that) Mixed: /h/ (high) Affricates (stop followed by fricative) Unvoiced: /ch/ (chicken) Voiced: /jh/ (orange) We refer to the above consonants as expanded obstruents

Unvoiced speech segregation Unvoiced speech constitutes 20-25% of all speech sounds It carries crucial information for speech intelligibility Unvoiced speech is more difficult to segregate than voiced speech Voiced speech is highly structured, whereas unvoiced speech lacks harmonicity and is often noise-like Unvoiced speech is usually much weaker than voiced speech and therefore more susceptible to interference

Processing stages of Hu-Wang’08 model Peripheral processing results in a two-dimensional cochleagram

Auditory segmentation Auditory segmentation is to decompose an auditory scene into contiguous time-frequency (T-F) regions (segments), each of which should contain signal mostly from the same sound source The definition of segmentation applies to both voiced and unvoiced speech This is equivalent to identifying onsets and offsets of individual T-F segments, which correspond to sudden changes of acoustic energy Our segmentation is based on a multiscale onset/offset analysis (Hu & Wang’07) Smoothing along time and frequency dimensions Onset/offset detection and onset/offset front matching Multiscale integration

Smoothed intensity Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise in a playground. Mixed at 0 dB SNR Scale in freq. and time: (a) (0, 0), initial intensity. (b) (2, 1/14). (c) (6, 1/14). (d) (6, 1/4)

Segmentation result The bounding contours of estimated segments from multiscale analysis. The background is represented by blue: One scale analysis Two-scale analysis Three-scale analysis Four-scale analysis The ideal binary mask The mixture

Grouping Apply auditory segmentation to generate all segments for the entire mixture Segregate voiced speech using an existing algorithm Identify segments dominated by voiced target using segregated voiced speech Identify segments dominated by unvoiced speech based on speech/nonspeech classification Assuming nonspeech interference due to the lack of sequential organization

Speech/nonspeech classification A T-F segment is classified as speech if Xs: The energy of all the T-F units within segment s H0: The hypothesis that s is dominated by expanded obstruents H1: The hypothesis that s is interference dominant

Speech/nonspeech classification (cont.) By the Bayes rule, we have Since segments have varied durations, directly evaluating the above likelihoods is computationally infeasible Instead, we assume that each time frame within a segment is statistically independent given a hypothesis A multilayer perceptron is trained to distinguish expanded obstruents from nonspeech interference

Speech/nonspeech classification (cont.) The prior probability ratio of , is found to be approximately linear with respect to input SNR Assuming that interference energy does not vary greatly over the duration of an utterance, earlier segregation of voiced speech enables us to estimate input SNR

Speech/nonspeech classification (cont.) With estimated input SNR, each segment is then classified as either expanded obstruents or interference Segments classified as expanded obstruents join the segregated voiced speech to produce the final output

Example of segregation Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise in a playground (IBM: Ideal binary mask)

SNR of segregated target Compared to spectral subtraction assuming perfect speech pause detection

Conclusion Analysis of ideal binary mask as CASA goal Formulation of the cocktail party problem as binary classification Segregation of unvoiced speech based on segment classification The proposed model represents the first systematic study on unvoiced speech segregation

Credits Speech intelligibility tests of IBM: Joint with Ulrik Kjems, Michael S. Pedersen, Jesper Boldt, and Thomas Lunner, at Oticon Unvoiced speech segregation: Joint with Guoning Hu