Recent work on Language Identification

Slides:

Advertisements

Similar presentations

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Advertisements

1 Patrol LID System for DARPA RATS P1 Evaluation Pavel Matejka Patrol Team Language Identification System for DARPA RATS P1 Evaluation Pavel Matejka 1,

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

Advances in WP2 Torino Meeting – 9-10 March

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Service Discrimination and Audit File Reduction for Effective Intrusion Detection by Fernando Godínez (ITESM) In collaboration with Dieter Hutter (DFKI)

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

Speaker Adaptation for Vowel Classification

8/12/2003 Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye International Computer Science Institute.

Chapter 11 Integration Information Instructor: Prof. G. Bebis Represented by Reza Fall 2005.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Tous droits réservés © 2005 CRIM The CRIM Systems for the NIST 2008 SRE Patrick Kenny, Najim Dehak and Pierre Ouellet Centre de recherche informatique.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Why is ASR Hard? Natural speech is continuous

Introduction Mapping Modeling Speaker Diarization Summary H. Aronowitz (IBM) Intra-Class Variability Modeling for Speech Processing June 08 1 Dr. Hagai.

9.0 Speaker Variabilities: Adaption and Recognition References: of Huang 2. “ Maximum A Posteriori Estimation for Multivariate Gaussian Mixture.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

PROSODY MODELING AND EIGEN- PROSODY ANALYSIS FOR ROBUST SPEAKER RECOGNITION Zi-He Chen, Yuan-Fu Liao, and Yau-Tarng Juang ICASSP 2005 Presenter: Fang-Hui.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

A Phonotactic-Semantic Paradigm for Automatic Spoken Document Classification Bin MA and Haizhou LI Institute for Infocomm Research Singapore.

Institute of Information Science, Academia Sinica, Taiwan Speaker Verification via Kernel Methods Speaker : Yi-Hsiang Chao Advisor : Hsin-Min Wang.

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Luis Fernando D’Haro, Ondřej Glembek, Oldřich Plchot, Pavel Matejka, Mehdi Soufifar, Ricardo Cordoba, Jan Černocký.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,

Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

Principal Component Analysis Machine Learning. Last Time Expectation Maximization in Graphical Models – Baum Welch.

Principal Manifolds and Probabilistic Subspaces for Visual Recognition Baback Moghaddam TPAMI, June John Galeotti Advanced Perception February 12,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Speaker Verification Using Adapted GMM Presented by CWJ 2000/8/16.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.

2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

A Tutorial on Speaker Verification First A. Author, Second B. Author, and Third C. Author.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Guillaume-Alexandre Bilodeau

Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition

Speaker Identification:

Presentation transcript:

Recent work on Language Identification Pietro Laface POLITECNICO di TORINO Brno 28-06-2009 Pietro LAFACE

Team POLITECNICO di TORINO LOQUENDO Pietro Laface Professor Fabio Castaldo Post-doc Sandro Cumani PhD student Ivano Dalmasso Thesis Student LOQUENDO Claudio Vair Senior Researcher Daniele Colibro Researcher Emanuele Dalmasso Post-doc

Outline Acoustic models Phonetic models LRE09 Fast discriminative training of GMMs Language factors Phonetic models 1-best tokenizers lattice tokenizers LRE09 Incremental acquisition of segments for the development sets This is the outline of my talk. I will illustrate the 3 main contributions of (1.30) our work on the acoustic models of a LID system. First: To reduce inter-speaker variability within the same language we have shown in last ICASPP that significant performance improvement in LID can be obtained performing speaker compensation in feature space using the Generalized Linear Discriminant Sequence Kernels approach. Here we use speaker compensation with Gaussian Mixture Models. Second: Since GMMs in combination with linear Support Vector Machine classifier s have been shown to give excellent classification accuracy in speaker recognition, we apply this approach for LID, and we compare its performance with the standard GMM based techniques. Third: In the GMM-SVM framework, a GMM is trained for each training or test utterance. Since it is difficult to accurately train a model with short utterances, in these conditions the standard GMMs perform better than the GMM-SVM models. To overcome this limitation, we present an extremely fast GMM discriminative training procedure that exploits the information given by the separation hyperplanes estimated by an SVM classifier.

Our technology progress Inter-speaker compensation in feature space GLDS / SVM models (ICASSP 2007) - GMMs SVM using GMM super‑vectors (GMM-SVM) Introduced by MIT-LL for speaker recognition Fast discriminative training of GMMs Alternative to MMIE Exploiting the GMM-SVM separation hyperplanes MIT discriminative GMMs Language factors This is the outline of my talk. I will illustrate the 3 main contributions of (1.30) our work on the acoustic models of a LID system. First: To reduce inter-speaker variability within the same language we have shown in last ICASPP that significant performance improvement in LID can be obtained performing speaker compensation in feature space using the Generalized Linear Discriminant Sequence Kernels approach. Here we use speaker compensation with Gaussian Mixture Models. Second: Since GMMs in combination with linear Support Vector Machine classifier s have been shown to give excellent classification accuracy in speaker recognition, we apply this approach for LID, and we compare its performance with the standard GMM based techniques. Third: In the GMM-SVM framework, a GMM is trained for each training or test utterance. Since it is difficult to accurately train a model with short utterances, in these conditions the standard GMMs perform better than the GMM-SVM models. To overcome this limitation, we present an extremely fast GMM discriminative training procedure that exploits the information given by the separation hyperplanes estimated by an SVM classifier.

Acoustic Language Identification Task similar to text-independent Speaker Recognition Gaussian Mixture Models (GMM) - MAP adapted from an Universal Background Model (UBM) UBM Language GMM For LID we use the same core technology that we use lo Speaker recognition:(40) Gaussian Mixture Models used in combination with Maximum A Posteriori adaptation MAP adaptation is not necessary in language recognition because every language GMM can be robustly trained by Maximum Likelihood estimation. However, we perform MAP estimation from a UBM also in LID for 3 main reasons MAP

GMM super‑vectors Appending the mean value of all Gaussians in a single stream we get a super-vector We use GMM super-vectors Without normalization Inter-speaker/channel variation compensation With Kullback‑Leibler normalization Training GMM-SVM models Training Discriminative GMMs In all our approaches we make use of supervectors (30)

Using an UBM in LID The frame based inter-speaker variation compensation approach estimates the inter- speaker compensation factors using the UBM In the GMM-SVM approach all language GMMs share the same weights and variances of the UBM The UBM is used for fast selection of Gaussians Our frame based inter-speaker variation compensation approach computes its speaker factors using the UBM. (40) Language models deriving from a common UBM are required by our GMM-SVM approach. A side benefit of this choice is that it allows fast selection of the Gaussians both in training and in testing. Thus, larger models can be trained discriminatively.

Speaker/channel compensation in feature space U is a low rank matrix (estimated offline) projecting the speaker/channel factors subspace in the supervector domain. x(i) is a low dimensional vector, estimated using the UBM, holding the speaker/channel factors for the current utterance i. is the occupation probability of the m-th Gaussian This slide summarizes our feature domain compensation technique. Each frame is compensated according to this formula where

Estimating the U matrix Estimating the U matrix with a large set of differences between models generated using different utterances of the same speaker we compensate the distortions due to the inter-session variability  Speaker recognition Estimating the U matrix with a large set of differences between models generated using different speaker utterances of the same language we compensate the distortions due to inter- speaker/channel variability within the same language  Language recognition

GMM-SVM A GMM model is trained for each utterance, both in train and in test Each GMM is represented by a normalized GMM super‑vector The normalization is necessary to define a meaningful comparison between GMM supervectors I summarize now the GMM-SVM approach, focusing on the main topics that are of interest for understanding our discriminative training approach. (1.10) The normalization is necessary to define a comparison between GMM supervectors in a suitable Euclidean space

Kullback‑Leibler divergence Two GMMs (i and j) can be compared using an approximation of the Kullback‑Leibler divergence The GMM comparison can be performed using (10) The interesting property of this measure is that …

Kullback‑Leibler normalization normalizing each supervector component according to The normalized UBM supervector defines the origin of a new space The KL divergence becomes an Euclidean distance The SVM language models are created using a linear kernel in this KL space Since a translation does not alter the relative distance, (50) among the new means, the UBM mean can be dropped, and the supervector normalization term can be simply reduced to a scaling factor

GMM-SVM models perform very well with rather long test utterances GMM-SVM weakness GMM-SVM models perform very well with rather long test utterances It is difficult to estimate a robust GMM with a short test utterance Exploit the discriminative information given by the GMM-SVM for fast estimation of discriminative GMMs But not so well for short TEST utterances, (1.20) because we cannot estimate a robust GMM with a short test utterance Thus, to improve the performance for short sentences we have to rely on discriminative GMMs. Here comes the idea of avoiding the time expensive MMIE training Long sentences are available for training.

SVM discriminative directions w: normal vector to the class‑separation hyperplane In particular we exploit the discriminative directions given by the vectors w estimated by the SVM in the KL space, w1, w2, w3 in the figure

GMM discriminative training Feature Space Language GMM Utterance GMM The red circles shown on the left side of Figure represent a two-dimensional projection of a set of utterance supervectors of language k mapped to the KL space. The green circles correspond to the utterance supervectors of one of the competitor languages, and the black circle is the UBM. (2) The Figure shows, on its right side, a two-dimensional acoustic feature space The black ellipses represent two Gaussians of the UBM. The red and the green ellipses represent the corresponding Gaussians of two languages, the red ones referring to language k. Wk1 and wk2 , in the acoustic feature space, are the rescaled components of supervector wk for the two Gaussians of the language k GMM shown in the figure. The figure suggests that the Gaussians of a language k are moved away from the corresponding Gaussians of the other languages along different directions, and with different shift size. These directions are the ones that optimize the discrimination of that language in the KL space, i.e. the directions that maximize the distance of the GMM of language k from its competitor GMMs. KL Space Shift each Gaussian of a language model along its discriminative direction, given by the vector normal to the class‑separation hyperplane in the KL space

Rules for selection of αk A discriminative GMM moves away from its original - MAP adapted - model, which best matches the training (and test) data. A large value of αk (shift size)  more discriminative model, but worse likelihood than less discriminative models Use a development set for estimating α The first rule derives from this observation (1) This is also true for the MMIE framework, which optimizes a discrimination function, not the likelihood of the model.

Experiments with 2048 GMMs Pooled EER(%) of Discriminative 2048 GMMs, and GMM-SVM on the NIST LRE tasks. In parentheses, the average of the EERs of each language. Year Models Discriminative GMMs GMM-SVM 3s 10s 30s 1996 11.71 (13.71) 3.62 (4.92) 1.01 (1.37) 2003 13.56 (14.40) 5.50 (6.02) 1.42 (1.64) 2005 16.94 (17.85) 9.73 (11.07 ) 4.67 (5.81 ) To enable the comparison with previous reported results, this Table shows the results in terms of pooled EER scores, and in parentheses the average of the EERs of each language for experiments with a gender independent 2048 Gaussian GMM-SVM on the 30s tests, and with gender dependent Discriminative GMMs on the shorter duration tests, achieve performance comparable to the ones reported by Brno University, which are among the best one presented so far for this database for an acoustic only system. 256-MMI (Brno University – 2006 IEEE Odyssey ) 2005 17.1 8.6 4.6

Pushed GMMs (MIT-LL) Another method, based on the use of the support vectors identified by the SVM, has been proposed in [8] for creating discriminative models. Since a support vector corresponds to a GMM supervector, the location of the positive boundaries of the SVM can be modeled by a weighted combination of the support vectors associated to positive Lagrange multipliers

Language Factors Eigenvoice modeling, , and the use of speaker factors as input features to SVMs, has recently been demonstrated to give good results for speaker recognition compared to the standard GMM-SVM approach (Dehak et al. ICASSP 2009). Analogy Estimate an eigen-language space, and use the language factors as input features to SVM classifiers (Castaldo et al. submitted to Interspeech 2009).

Language Factors: advantages Language factors are low-dimension vectors Training and evaluating SVMs with different kernels is easy and fast: it requires the dot product of normalized language factors Using a very large number of training examples is feasible Small models give good performance

Toward an eigen-language space After compensation of the nuisances of a GMM adapted from the UBM using a single utterance, residual information about the channel and the speaker remains. However, most of the undesired variation is removed as demonstrated by the improvements obtained using this technique

Speaker compensated eigenvoices First approach Estimating the principal directions of the GMM supervectors of all the training segments before inter- speaker nuisance compensation would produce a set of language independent, “universal” eigenvoices. After nuisance removal, however, the speaker contribution to the principal components is reduced to the benefit of language discrimination.

Eigen-language space Second approach Computing the differences between the GMM supervectors obtained from utterances of a polyglot speaker would compensate the speaker characteristics and would enhance the acoustic components of a language with respect to the others. We do not have labeled databases including polyglot speakers compute and collect the difference between GMM supervectors produced by utterances of speakers of two different languages irrespective of the speaker identity, already compensated in the feature domain we could in principle compute and collect the difference between GMM supervectors produced by utterances of speakers of two different languages irrespective of the speakers’ identity, which should have been already compensated in the feature domain

Eigen-language space The number of these differences would grow with the square of utterances of the training set. Perform Principal Component Analysis on the set of the differences between the set of the supervectors of a language and the average supervector of every other language.

The same used for LRE07 evaluation Training corpora The same used for LRE07 evaluation All data of the 12 languages in the Callfriend corpus Half of the NIST LRE07 development corpus Half of the OSHU corpus provided by NIST for LRE05 The Russian through switched telephone network Automatic segmentation

Eigenvalues of two language subspaces whereas the remaining eigenvalues decrease slowly indicating that the corresponding directions still contribute to language discrimination, even if they probably still account for residual channel and speaker characteristics The language subspace has higher eigenvalues, and both curves show a sharp decrease for their first 13 eigenvalues, corresponding to the main language discrimination directions.

Language factor’s minDCF is always better and more stable LRE07 30s closed set test 10% worse a 100 rispetto a 600 Language factor’s minDCF is always better and more stable

Pushed GMMs (MIT-LL)

Pushed eigen-language GMMs The same approach to obtain discriminative GMMs from the language factors

Min DCFs and (%EER) Models 30s 10s 3s GMM-SVM (KL kernel) 0.029 (3.43) 0.085 (9.12) 0.201 (21.3) GMM-SVM (Identity kernel) 0.031 (3.72) 0.087 (9.51) 0.200 (21.0) LF-SVM (KL kernel) 0.026 (3.13) 0.083 (9.02) 0.186 (20.4) LF-SVM (Identity kernel) (3.11) (9.13) 0.187 Discriminative GMMs 0.021 (2.56) 0.069 (7.49) 0.174 (18.45) LF-Discriminative GMMs (KL kernel) 0.025 (2.97) 0.084 (9.04) (19.9) LF-Discriminative GMMs (Identity kernel) (3.05) (9.05) (20.0) The results of our reference system were obtained using the KL kernel in the GMM-SVM approach, and are shown in the first row on Table 1 The first set of experiments were done to evaluate the performance of the identity kernel SVM classifiers as a function of the modeling eigen-language matrices,

Loquendo-Polito LRE09 System Model Training Acoustic features SVM-GMMs Pushed GMMs MMIE GMMs Phonetic transcriber N-gram counts TFLLR SMV

Phonetic models ASR Recognizer Output layer 700-1000 states for the language dependent phonetic units Stationary units 23 - 47 Diphone units ASR Recognizer phone-loop grammar with diphone transition constraints

Phone transcribers ASR Recognizer 12 phone transcribers for phone-loop grammar with diphone transition constraints 12 phone transcribers for French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. The statistics of the n-gram phone occurrences collected from the best decoded string of each conversation segment

Phone transcribers ANN models 10 phone transcribers for Same phone-loop grammar - different engine 10 phone transcribers for Catalan, French, German, Greek, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Turkish, UK and US English. The statistics of the n-gram phone occurrences collected from the expected counts from a lattice of each conversation segment

Multigrams Two different TFLLR kernels trigrams pruned multigrams multigrams can provide useful information about the language by capturing “word parts” within the string sequences

Total number of n-grams for 12 language transcribers Pruned Multigrams For each phonetic transcriber, we discard all the n-grams appearing in the training set less than 0.05% of the average occurrence of the unigrams Total number of n-grams for 12 language transcribers N-gram 1 2 3 4 5 6 Pruned 461 11477 120200 114396 10738 443

Scoring The total number of models that we use for scoring an unknown segment is 34: 11 channel dependent models (11 x 2) 12 single channel models (2 telephone and 10 broadcast models only). 23 x 2 for MMIE GMMs (channel independent but M/F)

Calibration and fusion Multi-class FoCal Pushed GMMs MMIE GMMs 1-best 3-grams SVMs 1-best n-grams SVMs Lattice n-grams SVMs 34 46 23 G. back-end LLR max lre_detection max of the channel dependent scores

Language pair recognition For the language-pair evaluation only the back-ends have been re-trained, keeping unchanged the models of all the sub-systems.

Telephone development corpora CALLFRIEND - Conversations split into slices of 150s NIST 2003 and NIST 2005 LRE07 development corpus Cantonese and Portuguese data in the 22 Language OGI corpus RuSTeN -The Russian through Switched Telephone Network corpus

“Broadcast” development corpora Incrementally created to include as far as possible the variability within a language due to channel, gender and speaker differences The development data, further split in training, calibration and test subsets, should cover the mentioned variability

Problems with LRE09 dev data Often same speaker segments Scarcity of segments for some languages after filtering same speaker segments Genders are not balanced Excluding “French”, the segments of all languages are either telephone or broadcast. No audited data available for Hindi, Russian, Spanish and Urdu on VOA3, only automatic segmentation was provided No segmentation was provided in the first release of the development data for Cantonese, Korean, Mandarin, and Vietnamese For these 8 missing languages only the language hypotheses provided by BUT were available for VOA2 data.

Additional “audited” data For the 8 languages lacking broadcast data, segments have been generated accessing the VOA site looking for the original MP3 files Goal collect ~300 broadcast segments per language, processed to detect narrowband fragments The candidates were checked to eliminate segments including music, bad channel distortions, and fragments of other languages

Development data for bootstrap models Telephone and audited/checked broadcast data Training (50 %) Development (25 %) Test (25 %) The segments were distributed to these sets so that same speaker segments were included in the same set. A set of acoustic (pushed GMMs) bootstrap models has been trained The telephone and the audited/checked broadcast data, were evenly split into a train and a development set. The development set has been further divided in two parts, one devoted to estimate the back-end parameters and the other for testing

Additional not-audited data from VOA3 Preliminary tests with the bootstrap models indicate the need of additional data Selected from VOA3 to include new speakers in the train, calibration and test sets assuming that the file label correctly identify the corresponding language

Speaker selection Performed by means of a speaker recognizer We process the audited segments before the others A new speaker model is added to the current set of speaker models whenever the best recognition score obtained by a segment is less than a threshold

Additional not-audited data from VOA2 Enriching the training set Language recognition has been performed using a system combining the acoustic bootstrap models and a phonetic system A segment has been selected only if the 1-best language hypothesis of our system had associated a score greater than a given (rather high) threshold matched the 1-best hypothesis provided by the BUT system

Total number of segments for this evaluation Set Corpora voa3_ A voa2_ A ftp_C voa3_ S voa2_ S ftp_S Train 529 116 316 1955 590 66 Extended train 114 22 65 2483 574 151 Developmen t 396 85 329 1866 449 45 Suffix: A audited C checked S automatic segmentation ftp: ftp://8475.ftp.storage.akadns.net/mp3/voa/

Hausa- Decision Cost Function DCF

Hindi- Decision Cost Function DCF

Results on the development set Test on Systems Pushed GMMs MMIE 3- grams Multi- grams Lattice Fusio n Broadcast & telephone 1.48 1.70 1.09 1.12 1.06 0.86 Broadcast subset 1.54 1.69 1.24 1.26 1.14 0.91 Telephone 2.00 2.51 1.45 1.49 1.42 1.21 Average minDCFx100 on 30s test segments

Korean - score cumulative distribution b-b t-t t-b b-t