Effects of Lombard Reflex on Deep-Learning-Based

Slides:



Advertisements
Similar presentations
AVQ Automatic Volume and eQqualization control Interactive White Paper v1.6.
Advertisements

Advanced Speech Enhancement in Noisy Environments
Stefan Bleeck, Institute of Sound and Vibration Research, Hearing and Balance Centre University of Southampton.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Speech Enhancement Based on a Combination of Spectral Subtraction and MMSE Log-STSA Estimator in Wavelet Domain LATSI laboratory, Department of Electronic,
A PRESENTATION BY SHAMALEE DESHPANDE
Department of Electrical Engineering Systems. What is Systems? The study of mathematical and engineering tools used to analyze and implement engineering.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Nico De Clercq Pieter Gijsenbergh Noise reduction in hearing aids: Generalised Sidelobe Canceller.
Video Tracking Using Learned Hierarchical Features
METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 1) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Sound-Event Partitioning and Feature Normalization for Robust Sound-Event Detection 2 Department of Electronic and Information Engineering The Hong Kong.
Advanced Topics in Speech Processing (IT60116) K Sreenivasa Rao School of Information Technology IIT Kharagpur.
 To Cover the basic theory and algorithms that are widely used in digital image processing.  To Expose students to current technologies and issues that.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Universal Design for Learning Kellie Scott EDUC 7109 Walden university.
Analyze Design Develop AssessmentImplement Evaluate.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Functional Listening Evaluations:
Autonomous Robots Vision © Manfred Huber 2014.
Performance Comparison of Speaker and Emotion Recognition
ARTIFICIAL INTELLIGENCE FOR SPEECH RECOGNITION. Introduction What is Speech Recognition?  also known as automatic speech recognition or computer speech.
2010/12/11 Frequency Domain Blind Source Separation Based Noise Suppression to Hearing Aids (Part 3) Presenter: Cian-Bei Hong Advisor: Dr. Yeou-Jiunn Chen.
Evaluation of a Binaural FMV Beamforming Algorithm in Noise Jeffery B. Larsen, Charissa R. Lansing, Robert C. Bilger, Bruce Wheeler, Sandeep Phatak, Nandini.
Speech Enhancement based on
Detection of nerves in Ultrasound Images using edge detection techniques NIRANJAN TALLAPALLY.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
语音与音频信号处理研究室 Speech and Audio Signal Processing Lab Multiplicative Update of AR gains in Codebook- driven Speech.
Spectral subtraction algorithm and optimize Wanfeng Zou 7/3/2014.
Presented By: Shamil. C Roll no: 68 E.I Guided By: Asif Ali Lecturer in E.I.
HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.
Speech Enhancement Algorithm for Digital Hearing Aids
AVQ Automatic Volume and eQualization control
CS 445/656 Computer & New Media
Speech Enhancement Summer 2009
4aPPa32. How Susceptibility To Noise Varies Across Speech Frequencies
Research on Machine Learning and Deep Learning
Applying Deep Neural Network to Enhance EMPI Searching
Introduction to Audio Watermarking Schemes N. Lazic and P
Bi-dialectalism: the investigation of the cognitive advantage and non-native dialect perception in noise Brittany Moore, Jackie Rayyan, & Lynn Gilbertson,
Going Green By Ima Librarian
Speaker-Targeted Audio-Visual Models for Speech Recognition in Cocktail-Party Environments Good morning, My name is Guan-Lin Chao, from Carnegie Mellon.
Automatic Speech Recognition
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Details Regarding Stimuli
Copyright © American Speech-Language-Hearing Association
Speech Enhancement with Binaural Cues Derived from a Priori Codebook
Visual Memory is Superior to Auditory Memory
AVQ Automatic Volume and eQqualization control
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Amir Hussain’s Brief Biography
BBI 3423 LANGUAGE AND ICT.
Elise A. Piazza, Marius Cătălin Iordan, Casey Lew-Williams 
Human Speech Perception and Feature Extraction
AIRWays Benchmark Previewing System
Audio and Speech Computers & New Media.
John H.L. Hansen & Taufiq Al Babba Hasan
Jinchang Ren’s Brief Biography
Advances in Deep Audio and Audio-Visual Processing
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
LOGAN: Unpaired Shape Transform in Latent Overcomplete Space
Combination of Feature and Channel Compensation (1/2)
Speech Enhancement Based on Nonparametric Factor Analysis
COPYRIGHT © All rights reserved by Sound acoustics Germany
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.
Presentation transcript:

Effects of Lombard Reflex on Deep-Learning-Based Audio-Visual Speech Enhancement Systems Daniel Michelsanti1, Zheng-Hua Tan1, Sigurdur Sigurdsson2 and Jesper Jensen1,2 1Centre for Acoustic Signal Processing Research (CASPR), Aalborg University 2Oticon A/S {danmi,zt,jje}@es.aau.dk {ssig,jesj}@oticon.com

About Us Daniel Michelsanti is a PhD Fellow at the Centre for Acoustic Signal Processing Research (CASPR), Aalborg University, Denmark, under the supervision of Zheng-Hua Tan, Sigurdur Sigurdsson and Jesper Jensen. His research interests include speech processing, computer vision and deep learning. Zheng-Hua Tan is a Professor in the Department of Electronic Systems at Aalborg University, Denmark. He is also a co-founder of CASPR. His research interests include machine learning, deep learning, speech and speaker recognition, noise-robust speech processing, multimodal signal processing, and social robotics. Sigurdur Sigurdsson is a Senior Specialist with Oticon A/S, Copenhagen, Denmark. His research interests include speech enhancement in noisy environments, machine learning and signal processing for hearing aid applications. Jesper Jensen is a Senior Principal Scientist with Oticon A/S, Copenhagen, Denmark, and a Professor in the Department of Electronic Systems, at Aalborg University. He is also a co-founder of CASPR. His main interests include signal retrieval from noisy observations, coding, intelligibility enhancement and signal processing for hearing aid applications. (D. Michelsanti, 2018) Aalborg University

Instructions This is a demonstration regarding the impact of the Lombard effect on speech enhancement. To navigate the demo you can: Click on the blue bar on the right to go to the next page. Click on the blue bar on the left to go to the previous page. Click on the media to play the content. The media in this demonstration are playable if they have a red square on the bottom left corner. (D. Michelsanti, 2018) Aalborg University

Speech Enhancement Speech enhancement is the task of estimating the clean speech of a target speaker immersed in an acoustically noisy environment, where different sources of disturbance are present, e.g. competing speakers, background music, and reflections from the walls. Usually this estimation is done by performing a manipulation of the time-frequency representation of the signal. Time Domain Time-Frequency Domain Icons designed by Freepik (D. Michelsanti, 2018) Aalborg University

Lombard Effect In presence of background noise, speakers instinctively change their speaking style to maintain their speech intelligible. This reflex is known as Lombard effect [1], and it is characterized by: an increase in speech sound level [2]. a longer word duration [3]. modifications of the speech spectrum [2]. a speech hyper-articulation [4]. It has been shown that the mismatch between the neutral and the Lombard speaking styles can lead to sub- optimal performance of speaker [5] and speech recognition [2] systems. (D. Michelsanti, 2018) Aalborg University

Architecture We use a neural network architecture inspired by [6] and identical to [7]. For the single-modality systems, one of the encoders is discarded. (D. Michelsanti, 2018) Aalborg University

Goal The purpose of this demo is two-fold: Showing the benefit of using visual information of speakers to enhance their speech. Comparing systems trained on non-Lombard (NL) speech with systems trained on Lombard (L) speech. We trained six deep-learning-based systems: The systems were trained on the utterances from the Lombard GRID corpus [8], to which speech shaped noise is added at several signal to noise ratios (SNRs). The following videos are from speakers observed during training (seen speakers). For more details, refer to [9]. AO-L – Audio-only trained on Lombard speech. AO-NL – Audio-only trained on non-Lombard speech. VO-L – Video-only trained on Lombard speech. VO-NL – Video-only trained on non-Lombard speech. AV-L – Audio-visual trained on Lombard speech. AV-NL – Audio-visual trained on non-Lombard speech. (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (-20 dB SNR) Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems. UNPROCESSED AO-L VO-L AV-L “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (-10 dB SNR) Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems. UNPROCESSED AO-L VO-L AV-L “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (0 dB SNR) Comparison between audio-only (AO), video-only (VO) and audio-visual (AV) systems. UNPROCESSED AO-L VO-L AV-L “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Estimated Speech Quality and Intelligibility The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estimators of speech quality and intelligibility, respectively. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practically between 0 and 1, higher scores correspond to higher speech intelligibility. (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (-20 dB SNR) Comparison between non-Lombard (NL) and Lombard (L) systems. AO-L AO-NL VO-L VO-NL AV-L AV-NL “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (-10 dB SNR) Comparison between non-Lombard (NL) and Lombard (L) systems. AO-L AO-NL VO-L VO-NL AV-L AV-NL “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Speech Enhancement (0 dB SNR) Comparison between non-Lombard (NL) and Lombard (L) systems. AO-L AO-NL VO-L VO-NL AV-L AV-NL “Lay blue by G zero soon” “Bin green by Q zero again” “Bin blue in Z seven please” (D. Michelsanti, 2018) Aalborg University

Estimated Speech Quality and Intelligibility The performance of the models are evaluated in terms of PESQ and ESTOI, because they are good estimators of speech quality and intelligibility, respectively. PESQ ranges from -0.5 to 4.5, where high values correspond to high speech quality. For ESTOI, whose range is practically between 0 and 1, higher scores correspond to higher speech intelligibility. (D. Michelsanti, 2018) Aalborg University

References [1] H. Brumm and S. A. Zollinger, “The evolution of the Lombard effect: 100 years of psychoacoustic research,” Behaviour, vol. 148, no. 11-13, pp. 1173–1198, 2011. [2] J.-C. Junqua, “The Lombard reflex and its role on human listeners and automatic speech recognizers,” The Journal of the Acoustical Society of America, vol. 93, no. 1, pp. 510–524, 1993. [3] A. L. Pittman and T. L. Wiley, “Recognition of speech produced in noise,” Journal of Speech, Language, and Hearing Research, vol. 44, no. 3, pp. 487–496, 2001. [4] M. Garnier, L. Ménard, and B. Alexandre, “Hyper-articulation in Lombard speech: An active communicative strategy to enhance visible speech cues?,” The Journal of the Acoustical Society of America, vol. 144, no. 2, pp. 1059–1074, 2018. [5] J. H. L. Hansen and V. Varadarajan, “Analysis and compensation of Lombard speech across noise type and levels with application to in-set/out-of-set speaker recognition,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 17, no. 2, pp. 366–378, 2009. [6] A. Gabbay, A. Shamir, and S. Peleg, “Visual speech enhancement,” in Proc. of Interspeech, 2018. [7] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, and J. Jensen, “On training targets and objective functions for deep-learning-based audio-visual speech enhancement,” arXiv preprint: https://arxiv.org/abs/1811.06234. [8] N. Alghamdi, S. Maddock, R. Marxer, J. Barker, and G. J. Brown, “A corpus of audio-visual Lombard speech with frontal and profile views,” The Journal of the Acoustical Society of America, vol. 143, no. 6, pp. EL523–EL529, 2018. [9] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, J. Jensen, “Effects of Lombard Reflex on the Performance of Deep-Learning-Based Audio-Visual Speech Enhancement Systems”, arXiv preprint: https://arxiv.org/abs/1811.06250. (D. Michelsanti, 2018) Aalborg University

Back to the Title Page (D. Michelsanti, 2018) Aalborg University