Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation.

Slides:



Advertisements
Similar presentations
Acoustic Echo Cancellation for Low Cost Applications
Advertisements

Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.
Results obtained in speaker recognition using Gaussian Mixture Models Marieta Gâta*, Gavril Toderean** *North University of Baia Mare **Technical University.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Electroacoustic Testing of DSP Hearing Aids Christine Cameron & Mary Hostler MCHAS Team University of Manchester.
Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.
Advances in WP2 Trento Meeting – January
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
Advances in WP1 and WP2 Paris Meeting – 11 febr
SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification
1 New Technique for Improving Speech Intelligibility for the Hearing Impaired Miriam Furst-Yust School of Electrical Engineering Tel Aviv University.
Real-Time Speech Recognition Thang Pham Advisor: Shane Cotter.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Twenty-First Century Automatic Speech Recognition: Meeting Rooms and Beyond ASR 2000 September 20, 2000 John Garofolo
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Interfacing with the Machine Jay Desloge SENS Corporation Sumit Basu Microsoft Research.
Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.
Center for Human Computer Communication Department of Computer Science, OG I 1 Designing Robust Multimodal Systems for Diverse Users and Mobile Environments.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Dealing with Acoustic Noise Part 2: Beamforming Mark Hasegawa-Johnson University of Illinois Lectures at CLSP WS06 July 25, 2006.
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Digital Signal Processing Jill, Jon, Kilo, Roger Spring ‘06.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Yi-zhang Cai, Jeih-weih Hung 2012/08/17 報告者:汪逸婷 1.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Issues in Multiparty Dialogues Ronak Patel. Current Trend  Only two-party case (a person and a Dialog system  Multi party (more than two persons Ex.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Counting How Many Words You Read
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Project-Final Presentation Blind Dereverberation Algorithm for Speech Signals Based on Multi-channel Linear Prediction Supervisor: Alexander Bertrand Authors:
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
Spatial Covariance Models For Under- Determined Reverberant Audio Source Separation N. Duong, E. Vincent and R. Gribonval METISS project team, IRISA/INRIA,
The Value of USAP in Software Architecture Design Presentation by: David Grizzanti.
State Tying for Context Dependent Phoneme Models K. Beulen E. Bransch H. Ney Lehrstuhl fur Informatik VI, RWTH Aachen – University of Technology, D
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Introduction Characteristics Advantages Limitations
Automatic Speech Recognition
Speech recognition in mobile environment Robust ASR with dual Mic
Progress Report - V Ravi Chander
UNIT 4 - BIG DATA AND PRIVACY
Missing feature theory
A maximum likelihood estimation and training on the fly approach
Speaker Identification:
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
COPYRIGHT © All rights reserved by Sound acoustics Germany
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation and Recognition Challenge

ASRU 2015 The 3 rd CHiME Speech Separation and Recognition Challenge Overview –Background – 1 st and 2 nd CHiME Challenges –The CHiME-3 scenario and task design –The baseline enhancement and ASR systems –The challenge results –Some findings and points for discussion 15 th Dec 20152

ASRU 2015 CHiME-3 Background 1 st CHiME challenge (2011) supported by EU PASCAL-network Computational Hearing in Multisource Environments Speech from the Grid corpus, i.e. simple command sentences Noise from binaural recordings of domestic environments Simulated speech and noise mixtures using impulse responses recorded 2 m from the microphones. 3

ASRU 2015 CHiME-3 Background 4 Top systems came very close to human performance. But the ASR task was too narrow and too artificial.

ASRU 2015 CHiME-3 Background 2 nd CHiME challenge held after ICASSP 2013 Tried to address biggest limitations of the 1 st challenge: –Artificial mixing but with time varying impulse responses to simulate small talker movements –Progressed from Grid corpus to the WSJ 5k task Step in the right direction but doubts remained over validity of using an artificially mixed test data. 5

ASRU 2015 CHiME-3 Objectives Feedback from 1 st and 2 nd CHiME challenges led to the following objectives for CHiME-3, –A commercially relevant scenario, e.g. move from binaural recording to a conventional mic array. –A larger variety of noise environments. –Increased realism of data, i.e. from artificial mixing to speech spoken and recorded live in noise. We also wanted to explicitly examine the role of simulated data (i.e. artificially mixed speech + noise), –Is simulated data useful for augmenting training data? –Can we trust evaluations that use simulated test data? –How can noisy mic array speech data best be simulated? 6

ASRU 2015 The CHiME-3 Scenario “ASR running on a mobile tablet device being used in noisy everyday setting, e.g. cafes, on the street etc.” 7

ASRU 2015 The CHiME-3 Hardware Android tablet with custom-built surround holding 6 microphones. 5 microphones facing forward and one facing backward. 8

ASRU 2015 The CHiME-3 Recording Set-up Portable battery-powered recording set-up. Records 6 tablet mics and a close-talking headset mic onto a pair of external digital recorders. 9

ASRU 2015 CHiME-3 Speech Data CHiME-3 based on WSJ 5k task. 12 native US speakers (6 male, 6 female) Speakers divided into 4 training, 4 dev and 4 test 4 recording environments (café, street, pedestrian, bus) Dev and test sets same as WSJ0, (i.e. 410 and 330 utterances), recorded in each of the 4 environments. Training data –1600 utterance subset of WSJ training data recorded in real environments –7138 simulated mixtures – WSJ + CHiME background noise 10

ASRU 2015 The CHiME-3 Noise Environments 11 Sitting in a cafeStanding at a street junction Travelling on a busIn a pedestrian area

ASRU 2015 CHiME-3 Baseline System Three components: –Baseline simulation (signals + MATLAB code) –Baseline enhancement (signals + MATLAB code) –Baseline ASR (Kaldi recipe) 12

ASRU 2015 CHiME-3: Baseline Systems Baseline simulation: Can simulated data be used to augment the limited amount of real training data? –Technique: 1.Estimate SNRs at each tablet mic using close talking mic. 2.Track speaker location using SRP-PHAT and calculate the time- varying delays to each tablet microphone. 3.WSJ utterances convolved with time-varying filters. 4.Apply filter to match microphone frequency response. 5.Mix with background noise collected in CHiME-3 environments. –Issues: no Lombard-like effects or reverberation; speaker tracking can be poor; true SNR hard to estimate. 13 Original clean Original clean Simulated noisy Simulated noisy

ASRU 2015 CHiME-3: Baseline Systems Baseline enhancement – MVDR beam-forming –multichannel covariance matrix of noise is estimated using up to 800 ms of context prior to utterance. Baseline ASR - two baseline Kaldi ASR systems –GMM system - triphone models, LDA, MLLT, fMLLR, SAT –DNN system - pre-training using RBMs, cross entropy training, sequence discriminative training. 14 simulated mixture simulated mixture enhanced real mixture enhanced

ASRU 2015 CHiME-3: Baseline Word Error Rates TrainingTestingSimulatedReal GMMClean dataNoisy data GMMNoisy data 18.7 GMMEnhanced DNNNoisy data DNNEnhanced TrainingTestingSimulatedReal DNNNoisy data DNNEnhanced Development data Final test data

ASRU 2015 CHiME-3: Submissions Full LDC-licensed dataset distributed to 65 sites Received 26 official submissions Large teams - average of 5 authors per paper Involvement from 36 institutions Even split between US, Europe, Asia Mix of academic and industrial e.g Hitachi, NTT, MERL Bias towards signal processing researchers; failed to attract participation from many big speech groups. 16

ASRU

ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 18

ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 19

ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 20

ASRU

ASRU 2015 CHiME-3: Future Directions Fewer microphones? Mismatched noise conditions. A more challenging, bigger scale task, –Large talker-microphone distances. –More complex speech. –Greater number of noise backgrounds and speakers. 22

ASRU 2015 Thank you for listening 23