Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Speech and Language Processing

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Language Identification and Part-of-Speech Tagging

Olivier Siohan David Rybach

Gist of Achieving Human Parity in Conversational Speech Recognition

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Investigating Pitch Accent Recognition in Non-native Speech

Mr. Darko Pekar, Speech Morphing Inc.

Sentiment analysis algorithms and applications: A survey

Deep Learning Amin Sobhani.

Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.

Recurrent Neural Networks for Natural Language Processing

Progress Report - V Ravi Chander

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

--Mengxue Zhang, Qingyang Li

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Maties Machine Learning meeting Emre Yılmaz

EEG Recognition Using The Kaldi Speech Recognition Toolkit

Statistical Models for Automatic Speech Recognition

Sfax University, Tunisia

Automatic Speech Recognition: Conditional Random Fields for ASR

Applied Linguistics Chapter Four: Corpus Linguistics

Research on the Modeling of Chinese Continuous Speech Recognition

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Speaker Identification:

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Idiap Research Institute University of Edinburgh

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Automatic Handwriting Generation

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee

Listen Attend and Spell – a brief introduction

Presentation transcript:

Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen CLS/CLST, Radboud University, Nijmegen, Netherlands Dept. of Electrical and Computer Engineering, National University of Singapore, Singapore

Abstract In this paper, we describe several techniques for improving the acoustic and language model of an automatic speech recognition(ASR) system operating on code-switching (CS) speech. We focus on the recognition of Frisian-Dutch radio broadcasts where one of the mixed languages, namely Frisian, is an under-resourced language.

Introduction(1/2) One fundamental approach is to label speech frames with the spoken language and perform recognition of each language separately using a monolingual ASR system at the back-end. These systems have the tendency to suffer from error propagation between the language identification front-end and ASR back-end, since language identification is still a challenging problem especially in case of intra- sentence CS. Our research in the FAME! project focuses on developing an all-in-one CS ASR system using a Frisian-Dutch bilingual acoustic and language model that allows language switches.

Introduction(2/2) One of the bottlenecks for building a CS ASR system is the lack of training speech and text data to train reliable acoustic and language models that can accurately recognize the uttered word and its language. The latter is relevant in language pairs such as Frisian and Dutch with orthographic similarity and shared vocabulary. For this purpose, we have proposed automatic transcription strategies developed for CS speech to increase the amount of training speech data in previous work.

Frisian-Dutch Radio Broadcast Database(1/2) The bilingual FAME! speech database, which has been collected in the scope of the Frisian Audio Mining Enterprise project, contains radio broadcasts in Frisian and Dutch. This bilingual data contains Frisian-only and Dutch-only utterances as well as mixed utterances with inter-sentential, intra-sentential and intra-word CS. To be able to design an ASR system that can handle the language switches, a representative subset of recordings has been extracted from this radio broadcast archive. These recordings include language switching cases and speaker diversity, and have a large time span (1966–2015).

Frisian-Dutch Radio Broadcast Database(2/2) The radio broadcast recordings have been manually annotated and cross- checked by two bilingual native Frisian speakers. The annotation protocol designed for this CS data includes three kinds of information: the orthographic transcription containing the uttered words, speaker details such as the gender, dialect, name (if known) and spoken language information. The language switches are marked with the label of the switched language.

Acoustic Modeling(1/3) In previous work, we described several automatic annotation approaches to enable using of a large amount of raw bilingual broadcast data for acoustic model training in a semi-supervised setting. For this purpose, we performed various tasks such as speaker diarization, language and speaker recognition and LM rescoring on raw broadcast data for automatic speaker and language tagging and later used this data for acoustic training together with the manually annotated (reference) data.

Acoustic Modeling(2/3) This work further focuses on the possible improvements in the acoustic modeling that can be obtained using other datasets with a much larger amount of monolingual speech data from the high-resourced mixed language, which is Dutch in our scenario. Previously, adding even a portion of this Dutch data resulted in severe recognition accuracy loss in the low-resourced mixed language due to the data imbalance between the mixed languages in the training data. Therefore, only a small portion of the available Dutch data could be used together with the Frisian data.

Acoustic Modeling(3/3) Given that the Dutch spoken in the target CS context is characterized by the West Frisian accent, we further include speech data from a language variety of Dutch, namely Flemish, to investigate its contribution towards the accent-robustness of the final acoustic model.

Language Modeling(1/3) Text generation using recurrent neural network (RNN) architectures is a common application that can be used to address this problem. Creating text of similar nature to the available limited amount of CS text data in one straightforward way of remedying this imbalance in the bilingual training text corpora. For this purpose, we train an LSTM-based language model on the transcriptions of the training CS speech data from the FAME! Corpus and generate CS text to investigate if including various amount of generated CS text in the training text corpus reduces the perplexity on the transcriptions of the development and test speech data.

Language Modeling(2/3) Using machine translated text is expected to improve the CS language model in two ways: (1) creating CS examples in the presence of proper nouns such as institution names, person names, place names in Dutch (2) generating CS word sequences that are extracted from a spoken corpus which is much larger than the FAME! Corpus. In this work, we used an open-source webservice for our Frisian-Dutch machine translation system which uses different language models from the baseline bilingual model used in this work.

Language Modeling(3/3) These automatic transcriptions are created using either a bilingual ASR system (bilingual strategies) or two monolingual ASR systems (monolingual strategies) based on the preprocessing applied to the raw broadcast recordings. The corresponding bilingual and monolingual LMs are trained on the baseline bilingual text corpus. Given that these automatic transcriptions are the mostly likely word sequence hypothesized based on both the acoustic and language model scores, they potentially contain new word sequences with CS that are unknown to the baseline LM.

Databases(1/2)

Databases(2/2) The baseline language models are trained on a bilingual text corpus containing 37M Frisian and 8.8M Dutch words. Using this small amount of CS text, we train LSTM-LM and generate text with 10M, 25M, 50M and 75M words. The translated CS text contains 8.5M words. Finally, we use the automatic transcriptions provided by the best- performing monolingual and bilingual automatic transcription strategy which contains 3M words in total.

Implementation Details(1/2) Kaldi ASR toolkit : GMM-HMM system with 40k Gaussians using 39 MFCC features including the deltas and delta-deltas to obtain the alignments for training a LF-MMI TDNN-LSTM AM according to the standard recipe provided for the Switchboard database in the Kaldi toolkit (ver. 5.2.99). We use 40-dimensional MFCC combined with i-vectors for speaker adaptation and the default training parameters provided in the recipe without performing any parameter tuning. The 3-fold data augmentation is applied to the training data.

Implementation Details(2/2) The baseline language models are standard bilingual 3-gram with interpolated Kneser-Ney smoothing and an RNN-LM with 400 hidden units used for recognition and lattice rescoring respectively. The bilingual lexicon contains 110k Frisian and Dutch words. The number of entries in the lexicon is approximately 160k due to the words with multiple phonetic transcriptions. The phonetic transcriptions of the words that do not appear in the initial lexicons are learned by applying grapheme-to-phoneme(G2P) bootstrapping.

Results(1/2)

Results(2/2)

Conclusions In this work, we describe several techniques to improve the acoustic and language modeling of a CS ASR system. Exploiting monolingual speech data from the high-resourced mixed language for improving the AM quality is found to be viable after increasing the amount of in- domain speech, for instance, by performing automatic transcription of raw data resembling the target speech. Moreover, increasing the amount of CS text by text generation using recurrent LMs trained on a very small amount of reference CS text and automatic transcriptions from different transcription strategies has provided enriched LMs that has significantly lower perplexities on the development and test transcriptions. These enriched LMs have also reduced the WER especially on the mixed segments containing words from both languages.