HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Auditory scene analysis 2
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Multipitch Tracking for Noisy Speech
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
A glimpsing model of speech perception
Content-Based Classification, Search & Retrieval of Audio Erling Wold, Thom Blum, Douglas Keislar, James Wheaton Presented By: Adelle C. Knight.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Why is ASR Hard? Natural speech is continuous
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING MARCH 2010 Lan-Ying Yeh
1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Introduction to Automatic Speech Recognition
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Kinect Player Gender Recognition from Speech Analysis
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Binaural Sonification of Disparity Maps Alfonso Alba, Carlos Zubieta, Edgar Arce Facultad de Ciencias Universidad Autónoma de San Luis Potosí.
Instrument Recognition in Polyphonic Music Jana Eggink Supervisor: Guy J. Brown University of Sheffield
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
By Sarita Jondhale1 Pattern Comparison Techniques.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
HOARSE Mid Term Review Coordinator’s Report Phil Green University of Sheffield, UK.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Extracting Melody Lines from Complex Audio Jana Eggink Supervisor: Guy J. Brown University of Sheffield {j.eggink
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES INTRODUCTION 6/1/ A.Aruna, Assistant Professor, Faculty of Information Technology.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Hearing Research Center
Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.
Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Performance Comparison of Speaker and Emotion Recognition
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
SOME SIMPLE MANIPULATIONS OF SOUND USING DIGITAL SIGNAL PROCESSING Richard M. Stern demo January 15, 2015 Department of Electrical and Computer.
Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation.
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Automatic Transcription of Polyphonic Music
Speech Enhancement Summer 2009
Auditory Localization in Rooms: Acoustic Analysis and Behavior
Introduction to Audio Watermarking Schemes N. Lazic and P
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Statistical Models for Automatic Speech Recognition
Statistical Models for Automatic Speech Recognition
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
EE513 Audio Signals and Systems
Missing feature theory
John H.L. Hansen & Taufiq Al Babba Hasan
INTRODUCTION TO ADVANCED DIGITAL SIGNAL PROCESSING
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department of Computer Science, University of Sheffield With thanks to Martin Cooke, Guy Brown, Jon Barker..

HCSNet December 2005 Overview Visual and Auditory Scene Analysis ‘Glimpsing’ in Speech Perception Missing Data ASR Finding the glimpses Current Sheffield Work Dealing with Reverberation Identifying Musical Instruments Multisource Decoding Speech Separation Challenge

HCSNet December 2005 Visual Scenes and Auditory Scenes Objects are opaque Each spatial pixel images a single object Object recognition has to cope with occlusion Sound is additive Each time/frequency pixel receives contributions from many sound sources Sound source recognition apparently requires reconstruction..

HCSNet December 2005 ‘Glimpsing’ in auditory scenes: the dominance effect (Cooke) Although audio signals add ‘additively’, the occlusion metaphor is a good approximation due to loglike compression in the auditory system Consequently, most regions in a mixture are dominated by one or other source, leaving very few ambiguous regions, even for a pair of speech signals mixed at 0 dB.

HCSNet December 2005 Can listeners handle glimpses?

HCSNet December 2005 The robustness problem in Automatic Speech Recognition Current ASR devices cannot tolerate additive noise, particularly if it’s unpredictable Listener’s noise-tolerance is 1 or 2 orders of magnitude better in equivalent conditions (Lippmann 97) Can glimpsing be used as the basis for robust ASR? Requirements: Adapt statistical ASR to incomplete data case Identify the glimpses Clean speech +noise Missing data Mask (oracle)

HCSNet December 2005 Classification with Missing Data A common problem: visual occlusion, sensor failure, transmission losses.. Need to evaluate the likelihood that observation vector x was generated by class C, f(x|C) Assume x has been partitioned into reliable and unreliable parts, (x r,x u ) Two approaches: Imputation: estimate x u, then proceed as normal Marginalisation: integrate over possible range of x u Marginalisation is preferable if there is no need to reconstruct x

HCSNet December 2005 The Missing Data Likelihood Computation In ASR by Continuous Density HMMS, State distributions are Gaussian Mixtures with diagonal covariance The marginal is just the reduced dimensionality distribution The integral can be approximated by ERFS This is computed independently for each mixture in the state distribution Cooke et al 2001

HCSNet December 2005 Counter-evidence from bounds frequency energy Observed spectrum x Mean spectrum for class C reliableunreliable Class C matches the reliable evidence well but there is insufficient energy in the unreliable components

HCSNet December 2005 Finding the glimpses Auditory scene analysis identifies spectral regions dominated by a single source Harmonicity Common amplitude modulation Sound source location Local SNR estimates can be used to compensate for predictable noise sources. Cooke 91

HCSNet December 2005 Harmonicity Masks Only meaningful in voiced segments Can be combined with SNR masks

HCSNet December 2005 Aurora Results (Sept 2001) Average gain over clean baseline under all conditions: 65% Barker et al 2001

HCSNet December 2005 Missing data masks from spatial location Cues for spatial location are used to separate a target source from masking sources Interaural Time Difference from corss-correlation between left and right binaural signals Interaural Level Difference from ratio of energy in left and right ears Soft masks Task: Target source: male speaker straight ahead One or two masking sources (also male speakers) at other positions Added reverberation Sue Harding, Guy Brown

HCSNet December 2005 Time (frames) Frequency channel Localisation mask, ITD only Time (frames) Frequency channel Localisation mask, ILD/ITD Frequency channel Localisation mask, ILD only Time (frames) Oracle ITD only, ILD only, combined ITD and ILD. Best performance is with combined ITD and ILD: Missing data masks from spatial location (2) % Accuracy % Accuracy Azimuth of masker (degrees)

HCSNet December 2005 MD for reverberant conditions (1) Palomäki, Brown and Barker have applied MD to the problem of room reverberation: Use spectral normalization to deal with distortion caused by early reflections; Treat late reverberation as additive noise, and apply standard MD techniques. Select features which are uncontaminated by reverberation and contain strong speech energy. Approach based on modulation filtering: Each rate map channel passed through modulation filter Identify periods with enough energy in the filtered output Use these to define mask on original rate map

HCSNet December 2005 MD for reverberant conditions (2) Recognition of connected digits (Aurora 2) Reverberated using recorded room impulse responses Performance comparable with Brian Kingsbury’s hybrid HMM-MLP recognizer K. J. Palomäki, G. J. Brown and J. Barker (2004) Speech Communication 43 (1-2), pp

HCSNet December 2005 MD for music analysis (1) Eggink and Brown have used MD techniques to identify concurrent musical instrument sounds Part of a system for transcribing chamber music Identify the F0 of the target note, and only keep its harmonics in the MD mask Uses a GMM classifier for each instrument, trained on isolated tones and short phrases Tested on tones, phrases and commercial CD

HCSNet December 2005 MD for music analysis (2) Example: duet for flute and clarinet All instrument tones correctly identified in this example J. Eggink and G. J. Brown (2003) Proc. ICASSP, Hong Kong, IV, pp J. Eggink and G. J. Brown (2004) Proc. ICASSP, Montreal, V, pp Time (frames) Fundamental Frequency (Hz) Flute Clarinet

HCSNet December 2005 Multisource Decoding Use primitive ASA and local SNR to identify time-frequency regions (fragments) dominated by a single source… i.e. possible segregations S … but NOT to decide what the best segregation is Based on missing data techniques – regions hypothesised as non- speech are missing Decoding algorithm finds best subset of fragments to match speech source Instead, jointly optimise over the word sequence W and S Barker, Cooke & Ellis 2003

HCSNet December 2005 Multisource decoding algorithm Work forward in time, maintaining a set of alternative decodings – Viterbi searches based on a choice of speech fragments. When new fragment arrives, split decodings - speech or non-speech? When fragment ends, merge decoders which differ in its interpretation.

HCSNet December 2005 Multisource Decoding on Aurora 2

Multisource decoding with a competing speaker Andre Coy and Jon Barker Utterances of male and female speakers mixed at 0 db Voiced regions: Soft Harmonicity masks from autocorrelation peaks Voiceless regions: fragments from ‘image processing’ Gender-dependent HMMs. Separate decoding for male & female 73.7% accuracy on a connected digit task

HCSNet December 2005 Informing Multisource Decoding – Work in progress Ning Ma, Andre Coy, Phil Green HMM Duration constraints Links between fragments – pitch continuity ‘Speechiness’

HCSNet December 2005 Speech separation challenge Organisers: Martin Cooke (University of Sheffield, UK), Te- Won Lee (UCSD, USA) see Global comparison of techniques for separating and recognising speech Special session of Interspeech 2006 in Pittsburgh (USA) from September, Task- recognise speech from a target talker in the presence of either stationary noise or other speech. Training and test data supplied. One signal per mixture (i.e. the task is "single microphone"). Speech material- simple sentences from the ‘Grid Task’, e.g. “place white at L 3 now"