ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Speech Recognition with Hidden Markov Models Winter 2011
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Automatic Feature Extraction for Multi-view 3D Face Recognition
Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.
HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Speaker Adaptation for Vowel Classification
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.
LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.
The Implicit Mapping into Feature Space. In order to learn non-linear relations with a linear machine, we need to select a set of non- linear features.
Optimal Adaptation for Statistical Classifiers Xiao Li.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
FYP0202 Advanced Audio Information Retrieval System By Alex Fok, Shirley Ng.
2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Macquarie RT05s Speaker Diarisation System Steve Cassidy Centre for Language Technology Macquarie University Sydney.
Notes on ICASSP 2004 Arthur Chan May 24, This Presentation (5 pages)  Brief note of ICASSP 2004  NIST RT 04 Evaluation results  Other interesting.
Multimodal Interaction Dr. Mike Spann
Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,
Speech and Language Processing
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Recent Work on Acoustic Modeling for CTS at ISL Florian Metze, Hagen Soltau, Christian Fügen, Hua Yu Interactive Systems Laboratories Universität Karlsruhe,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Research & Technology Progress in the framework of the RESPITE project at DaimlerChrysler Research & Technology Dr-Ing. Fritz Class and Joan Marí Sheffield,
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者:郝柏翰 2013/05/23.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
LECTURE 10: DISCRIMINANT ANALYSIS
ECE539 final project Instructor: Yu Hen Hu Fall 2005
CRANDEM: Conditional Random Fields for ASR
A Tutorial on Bayesian Speech Feature Enhancement
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Speaker Identification:
Learning Long-Term Temporal Features
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories

RT-02 Wokshop 2 Outline Experiments on SWB Adaptation to Meeting data Multi-domain Training for meeting recognition Model based Acoustic Mapping

RT-02 Wokshop 3 SWB: from 1997 to now ´97 eval. System [Finke97]: –Frontend: CMS, no CVN, LDA –AM : 25k distributions defined over 10k codebooks –LM : 3gram swb + class swb + 4gram BN –Multiple search passes to estimate VTLN, MLLR –45.1% on eval97 (best result was 45.0%) Tested ´97 swb system on eval2001: -> 36.5% error rate –More than 11% worse compared to 2001 top system

RT-02 Wokshop 4 Training Data Used ISIP transcripts –Re-checked segments by flexible transcription alignment –Skipped all turns containing noises or single words only –no gain compared to ´97 transcripts ! Development data: –1h subset from eval2001

RT-02 Wokshop 5 Traditional MFCC Front-end Observations: –Many linear transform stages –Many dimensionality reduction stages –Many different criterions! Can we streamline this process using data-driven optimization? FFT Mel-scale Filterbank logDCTCMN ,  LDAMLLT

RT-02 Wokshop 6 Optimizing the MFCC Front-end ,  can be generalized by concatenating adjacent N frames. Then use LDA to choose the final projection. DCT can be removed without affecting performance. Mel-scale filterbank can be removed, but with a big increase in computation FFT Mel-scale Filterbank logDCTCMN ,  LDAMLLT

RT-02 Wokshop 7 Front-end Experiments on SWB SystemWER (%) Baseline39.8* +data-driven ,  plain CVN39.7 +SCMN37.8 +MLLT35.6 Test set: hub5e_01 subset * The baseline is trained on 180 hrs, while the rest uses a 66 hrs subset for training.

RT-02 Wokshop 8 Frontend, Semi-tied covariances Front-end –Speaker based CVN –15 adjacent frames instead of delta, delta-deltas: –34.2% -> 33.1% Semi-tied covariances No MLLRMLLR No STC36.7%34.1% Global33.7%32.2% Per Phone33.4%33.1%

RT-02 Wokshop 9 AM Training Decision Tree –10000 context dependent states –Increased context from ±2 to ± 3 : 34.7% -> 34.2% –modalities Growing of gaussians: K-meansIncr. growing 10000x2433.8%33.1% 10000x3233.7%32.4%

RT-02 Wokshop 10 Speaker Adaptive Training Feature Space Adaptation –Training: single FSA matrix for each conversation side –Decoding: estimate FSA matrix first, compute MLLR matrices on adapted feature space –System A : 34.8% -> 33.7% –System B : 30.9% -> 30.8% (better frontend, STC,...) SAT (model space) –Dynamic nr. of transforms (~ 10 ) for each training speaker –Full and Diagonal transforms tried –No gains even on non-STC system !

RT-02 Wokshop 11 SWB summary LM: interpol. with 5gram class SWB : 33.5% -> 32.9% Results on eval2001 : 29.5% –But single system only –No rover, consensus Next steps: –Fix SAT problems –Improve LM (BN corpus, distance ngrams) –Modality dependent training on SWB -Integrate gender, speaking rate, etc. into decision tree -Did work on dialects, hyperarticulated speech [Fuegen2000, Soltau2000]

RT-02 Wokshop 12 Experiments on Meetings LIMSI style automatic partitioning scheme for the table-mic, manual segmentation for personal mic AM: SWB 244k Gaussians, BN 104k Gaussians BN LM, 40k vocab 1 st pass decoding, no adaptation b008 devtestPersonal micTable mic SWB models BN models b009 devtestPersonal micTable mic SWB models BN models

RT-02 Wokshop 13 The Cross-talk Challenge Cross-talk causes many problems, especially in the table-mic case The current scoring tool won’t handle overlap correctly! Scoring on the non-crosstalk region (using word level alignment) But cross-talk is much more than just a scoring issue! b008 table-micOriginal scoring on entire devtest Scoring on non- crosstalk region SWB models BN models

RT-02 Wokshop 14 Multi domain Training Combined BN with ESST –ESST = 30h, conversational speech, but clean channels –Tested on in-house meeting dev. Set (4 meetings, 1h) ESST54.1% BN44.2% ESST+BN42.2% SWB42.0%

RT-02 Wokshop 15 Model-combination-based Acoustic Mapping MAM [Westphal2001] tries to find a non-linear mapping of feature vectors using a pair of corresponding clean and noisy GMM‘s (used for car data, distance talking) SNR for each speaker is different, so we made speaker segmentation by reference and did test individually. Finally, we concat the hyposis of each speaker to score the overall ER.

RT-02 Wokshop 16 Acoustic Mapping Result Tested on b008 devtest (6 spk, 600 sec.) Individual result Overall result –BN baseline: 68.6% –BN baseline + AM: 66.5% ER\SPK/length(s.)a/203n/177j/126l/77c/45jc/22 BN baseline BN baseline+AM