1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1,

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Introduction to Neural Networks Computing

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Brno University Of Technology Lukáš Burget, Michal Fapšo, Valiantsina Hubeika, Ondřej Glembek, Martin Karafiát, Marcel Kockmann, Pavel Matějka,

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

A Text-Independent Speaker Recognition System

BUT SWS Massive parallel approach Brno University of Technology Faculty of Information Technology Igor Szöke, Lukáš Burget, František.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Supervised and Unsupervised learning and application to Neuroscience Cours CA6b-4.

Speaker Adaptation for Vowel Classification

Optimal Adaptation for Statistical Classifiers Xiao Li.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.

Joint Training Of Convolutional And Non-Convolutional Neural Networks

SNR-Dependent Mixture of PLDA for Noise Robust Speaker Verification

Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique.

Why is ASR Hard? Natural speech is continuous

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

Introduction to Automatic Speech Recognition

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

Cepstral Vector Normalization based On Stereo Data for Robust Speech Recognition Presenter: Shih-Hsiang Lin Luis Buera, Eduardo Lleida, Antonio Miguel,

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Bayesian networks Classification, segmentation, time series prediction and more. Website: Twitter:

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Basics of Neural Networks Neural Network Topologies.

Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.

An i-Vector PLDA based Gender Identification Approach for Severely Distorted and Multilingual DARPA RATS Data Shivesh Ranjan, Gang Liu and John H. L. Hansen.

Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

SNR-Invariant PLDA Modeling for Robust Speaker Verification Na Li and Man-Wai Mak Department of Electronic and Information Engineering The Hong Kong Polytechnic.

January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Speech Enhancement based on

2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Convolutional Neural Network

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,

ARTIFICIAL NEURAL NETWORKS

Feature Mapping FOR SPEAKER Diarization IN NOisy conditions

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

8-Speech Recognition Speech Recognition Concepts

Word Embedding Word2Vec.

John H.L. Hansen & Taufiq Al Babba Hasan

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

SNR-Invariant PLDA Modeling for Robust Speaker Verification

Learning Long-Term Temporal Features

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1, Oldrich Plchot 1, Mehdi Soufifar 1, Ondrej Glembek 1, Luis Fernando D’Haro 1, Karel Vesely 1, Frantisek Grezl 1, Jeff Ma 2, Spyros Matsoukas 2, and Najim Dehak 3 1 Brno University of Technology, and IT4I Center of Excellence, Czech 2 Raytheon BBN Technologies, Cambridge, MA, USA 3 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA

2 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Outline  About DARPA RATS program  Datasets and task description  Subsystems with analysis  Fusion and Results  Conclusion

3 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka DAPRA RATS Program  RATS = Robust Automatic Trascription of Speech  Goal : create algorithms and software for performing the following tasks on speech-containing signals received over communication channels that are extremely noisy and/or highly distorted.  Tasks : –Speech Activity Detection –Keyword Spotting –Language Identification –Speaker Identification  Data collector : LDC  Evaluation by SAIC  Performer: PATROL Team led by BBN

4 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Data Specification  Languages: –Dari, Levantine Arabic, Urdu, Pashtu, Farsi –>10 out of set languages  Durations: 120s, 30s, 10s, 3s  Telephone conversations retransmitted over 8 noisy radio communication channels [marked as A-H]  Available: collections of 2-min audio samples –LDC2011E95 – split to train and dev by SAIC –LDC2011E111 – split to train and dev by Patrol team –LDC2012E03 – supplemental training for non-target languages  The amount of audio data for different languages heavily unbalanced  Added shorter duration samples –Derived from 2-min samples, based on our SAD output

5 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Datasets  Train –Main Only files where VAD detects >60s of speech files together Unbalanced = 668 files for Dari, for Leventine Arabic –Balanced Balanced over files for each language and channel 7150 files for each duration 673 files for Dari, otherwise ~1300 –Extended Main + all 30sec cuts from Main set + entire LDC2012e03 (only nontarget languages) ~170k segments  Development Set –Corpus was driven by Dari - only 679 source files, other languages limited to 1000 files, 2432 files for non target languages –~7120 files for each duration  Evaluation Data –2527 files for each duration

6 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka LID Patrol System Architecture Audio CZ Phoneme Recognition Phonotactic iVector LID iVector LID JFA LID BBN SAD iVector LID Combined Score BUT SAD Calibration & Fusion

7 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Speech Activity Detection  One of the most important blocks since the data really difficult –See separate paper about SAD development on Wednesday 16:00 in Pavilon West  Used both GMM-based (BBN) and MLP-based (BUT) detectors.

8 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Speech Activity Detection  Comparison of different SAD systems –Robust SAD tuned for noisy telephone speech –Robust SAD tuned for RATS  Results (Cavg) are on DEV set (but scored with SRC channel)  iVector system (600dim) used for this experiment  25% relative gain SAD type/ Cavg[%]120s Telephone2,2 RATS1,6

9 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka iVector LID System (BUT)  Acoustic Features –Dithering, bandwidth Hz for 25 Mel-filters, 6 MFCC+C0 –CMN/CVN (based on SAD), RASTA normalization –Shifted Delta Cepstra (SDC)  UBM –Language independent, diagonal-covariance, 2048 Gaussians –Trained on balanced train set  iVector –600 dimensions –Trained on main set  Neural network classifier –iVector input, 6 outputs (1 nontarget + 5 target languages) –Hidden layer with 200 nodes –Stochastic Gradient Descent training with L2 regularization –Trained on extended set (all data + all 30 sec splits)

10 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Comparison of Logistic Regression and Neural Network as final classifier  BUT iVector system (600dim)  Results on Development set  Logistic Regression  trained by: Trusted Region Conjugate-GD  Results on Development set  Neural Net:  one hidden layer 200  trained by: Stochastic-GD with L2 regularization  also experiments with Conjugate-GD, but no improvement

11 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka JFA LID System (BUT)  Acoustic Features –Same as for iVector system + Wiener filtering  Universal Background Model (UBM) –Language independent, –Diagonal-covariance, 2048 Gaussians –Trained on balanced train set  JFA –Trained on main train set –μ = m + Dz + Ux –Models of languages D are MAP adapted from UBM with tau =10 –Channel matrix U with 200 dimensions –Linear scoring

12 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Importance of Wiener Filter  400dim i-vector + logistic regression experimental system  Results on Development set

13 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka iVector LID System (BBN)  Acoustic Features –RASTA-PLP –Block of 11-frame PLPs, projected to 60 dimensions via HLDA  UBM –Language dependent (5 target, 1 “non-target”), 1024 Gaussians  iVector –400 dimensions –Group adjacent speech segments into 20s chunks, estimate one iVector per chunk improves performance on short duration conditions by 28% –Estimate 6 iVectors (one per UBM) –Apply neural network (NN) to each iVector - 6 outputs (1 nontarget + 5 target languages) –Combine NN outputs to form 6-dimensional score vector 26% relative improvement compared to using language independent i-vectors

14 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Analysis of iVector LID System (BBN) Analysis of the BBN iVector extractor training and UBM: 1.Whole audio segments, single UBM 2.Audio split to 20s segments, single UBM 3.Audio split to 20s segments, language dependent background models (LDBM)

15 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Phonotactic iVector LID System (BUT)  Phoneme recognizer –Czech CTS recognizer trained on artificially noised data Added noise with varying SNR (lowest 10dB) to 30% of the corpus –38 phonemes –3-gram counts: sum of posterior probabilities of 3-grams from phone lattices  iVector – Multinomial subspace modeling –600 dimensions, trained on main train set –Training a low-dimensional subspace in the framework of total variability model using multinomial distribution –Using point-estimate of the model’s latent variable for each utterance as our new features  Logistic regression as final classifier –Trained on main train set

16 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Fusion and Calibration  Regularized logistic regression –Objective: minimize cross-entropy on development set –Duration-independent – trained on files from 10s, 30s, and 120s conditions  Procedure –Calibrate (tune) each system individually –Combine calibrated system outputs into a single output vector Fusion parameters estimated on the same development set  Performance evaluation –Primarily Cavg score –Also computed P MISS and P FA at Phase 1 target operating points

17 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Overall Results

18 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Robustness  There is channel B completely removed from the training of the contrastive system (noB) (channel B is unseen channel)  Results on Development set with BUT iVector system (600dim)  Over all results System/Cavg[%]120s30s10s3s iVector NN iVector NN noB System/Cavg[%]120s30s10s3s iVector NN iVector NN noB  Results only for channel B

19 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Conclusion  SAD is crucial  De-noising helps  Benefit from using Language dependent UBM  Benefit from using NN as final classifier for LID