Maties Machine Learning meeting Emre Yılmaz

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Information Extraction from Spoken Language Dr Pierre Dumouchel Scientific Vice-President, CRIM Full Professor, ÉTS.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
J. Kunzmann, K. Choukri, E. Janke, A. Kießling, K. Knill, L. Lamel, T. Schultz, and S. Yamamoto Automatic Speech Recognition and Understanding ASRU, December.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Li Deng Microsoft Research Redmond, WA Presented at the Banff Workshop, July 2009 From Recognition To Understanding Expanding traditional scope of signal.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
The Language Archive – Max Planck Institute for Psycholinguistics Nijmegen, The Netherlands Application of Audio and Video Processing Methods for Language.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Search and Annotation Tool for Oral History INTER-VIEWS Henk van den Heuvel, Centre for Language and Speech Technology (CLST) Radboud University Nijmegen,
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
Language Identification and Part-of-Speech Tagging
Olivier Siohan David Rybach
Xiaolin Wang Andrew Finch Masao Utiyama Eiichiro Sumita
Research on Machine Learning and Deep Learning
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Investigating Pitch Accent Recognition in Non-native Speech
Language and Television Sports commentaries and documentaries
Automatic Speech Recognition
Chapter 6. Data Collection in a Wizard-of-Oz Experiment in Reinforcement Learning for Adaptive Dialogue Systems by: Rieser & Lemon. Course: Autonomous.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
A Pool of Deep Models for Event Recognition
Conditional Random Fields for ASR
For Evaluating Dialog Error Conditions Based on Acoustic Information
Deep Exploration and Filtering of Text (DEFT)
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Multimedia Information Retrieval
Tight Coupling between ASR and MT in Speech-to-Speech Translation
Audio Books for Phonetics Research
Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Automatic Speech Recognition: Conditional Random Fields for ASR
Family History Technology Workshop
Outline Background Motivation Proposed Model Experimental Results
Research on the Modeling of Chinese Continuous Speech Recognition
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Presenter : Jen-Wei Kuo
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Language Transfer of Audio Word2Vec:
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Huawei CBG AI Challenges
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
1-P-30 Speech-to-Speech Translation using Dual Learning and Prosody Conversion Zhaojie Luo, Yoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki (Kobe.
Presentation transcript:

Language and Speaker Recognition for Semi-Supervised Bilingual Acoustic Model Training Maties Machine Learning meeting 09.03.2018 Emre Yılmaz 1 CLS/CLST, Radboud University, Nijmegen, Netherlands 2 STAR Lab, SRI International, Menlo Park, CA, USA

Frisian Audio Mining Enterprise: FAME! Project Goal: Disclose the Omrop Fryslân (Frisian Broadcast) archives containing recordings from 1950s to present 2

Focus of this talk FAME! Project: Spoken document retrieval in a radio archive with code-switching (CS) Frisian-Dutch speech Unlike Dutch, Frisian is a low-resourced language with limited amount of manually annotated speech data Semi-supervised training is common practice in monolingual scenarios Contribution of this work: extension of this idea to bilingual scenarios by automatically annotating raw data from the archives containing CS speech Relevant applications given the bilingual nature of the data Bilingual automatic speech recognition (ASR) Speaker diarization/linking on longitudinal data Language diarization/recognition 3

Frisian language & FAME! CS speech corpus 4

Frisian Language & FAME! CS Speech Corpus West Frisian is one of the three Frisian languages (together with East and North Frisian spoken in Germany). It has approximately half a million bilingual speakers mostly living in the province Fryslân located in the northwest of the Netherlands. The Frisian speech data has been collected from the archives of Omrop Fryslân The annotation protocol includes three kinds of information: Orthographic transcription Metadata such as dialect, speaker sex and name (if known) Spoken language information 5

CS and Speaker Content of Reference Data This bilingual data contains Frisian-only and Dutch-only utterances as well as mixed utterances with inter-sentential, intra-sentential and intra- word CS. The database contains more than 10 hours of Frisian speech and 4 hours of Dutch speech. The total number of word- and sentence-level code switching cases is 3837. 75.6% of the all switches are from Frisian to Dutch. The content of the recordings is very diverse, including radio programs about culture, history, literature, sports, nature, agriculture, politics, society and languages. There are 334 identified and 120 unidentified speakers in the FAME! speech database. 51 identified speaker appear at least in 2 different years in the database, mostly program presenters and celebrities. There are 10 speakers who are labeled to speak in both languages. 6

Automatic Annotation of Raw Broadcast Data 7

Overview of Automatic Annotation Approaches 8

Front-end Applications 9

Overview of Automatic Annotation Approaches 10

Front-end Applications Speech Activity Detection (SAD) In all approaches, speech-only segments are extracted from a large amount of raw broadcast data using a robust speech activity detection system The speech-only segments belonging to the same recording are merged to extract a single speech segment for each raw recording A DNN-based SAD approach has been used in the experiments which is detailed in (Graciarena, 2016) 11

Overview of Automatic Annotation Approaches 12

Front-end Applications Speaker Diarization (SD) The speech-only segments are labeled with speaker ids by using a SD system The intent of SD in the current context is to aid in speaker- adaptive training which brings improved ASR performance in similar monolingual applications (Cerva, 2013) Errors from diarization are expected to have limited impact on ASR since the errors will likely be due to similar sounding speakers An i-vector+PLDA-based SD system has been used which resembles the system described in (Sell, 2014) 13

Overview of Automatic Annotation Approaches 14

Front-end Applications Speaker Linking (SL) Linking the speaker labels assigned by the SD system is a straightforward step towards improving the quality of the speaker labels assigned to the raw data For this purpose, we use a speaker identification system trained on a large amount of multilingual data to assign speaker similarity scores Similarity scores for all possible speaker pairs are calculated Speaker linking is performed by applying complete-linkage clustering as described in (Ghaemmaghami, 2016) 15

Overview of Automatic Annotation Approaches 16

Front-end Applications Language Recognition (LR) The speech segments are labeled with a language tag in two different stages to investigate the impact of different pipelines on the automatic annotation quality The first approach performs language labeling after assigning the speaker labels With the assumption of monolingual speakers, the same- speaker segments are merged and labeled using a language recognition system After assigning the labels, each utterance is automatically transcribed by the corresponding monolingual ASR system 17

Overview of Automatic Annotation Approaches 18

Front-end Applications Language Diarization (LD) The second approach performs language labeling right after the SAD system Language scores are assigned to the overlapping speech segments of 𝑁 seconds with a frame shift with 𝐾 seconds for 𝑁>𝐾 For each segment of 𝐾 seconds, we apply majority voting among all language scores to decide on the assigned language label Each recording is segmented at language switch instants and each segment is recognized using the corresponding monolingual ASR system 19

Back-end Applications 20

Overview of Automatic Annotation Approaches 21

Back-end Applications Bilingual ASR System Two-stage training using the target CS speech in the second step to tune the DNN models The most likely hypothesis output by the recognizer is used as the reference transcription 22

Overview of Automatic Annotation Approaches 23

Back-end Applications Monolingual ASR Systems One low-resourced and one high-resourced language as mixed languages in our case Monolingual resources of the highly resourced language, Dutch in this case, provides better ASR performance compared to a bilingual ASR The multilingual training approach can also help with the recognition of Frisian-only segments, which will also be better than a bilingual ASR Higher ASR accuracy implies better automatic transcription given a decent LR/LD performance 24

Overview of Automatic Annotation Approaches 25

Back-end Applications Language Model (LM) Rescoring In previous work, the automatic transcriptions extracted with and without the rescoring stage have given similar results Rescoring was performed with a bilingual LM which has a higher perplexity compared to the monolingual LMs In this work, we include the rescoring stage expecting more significant improvements in transcription quality by using monolingual LMs for rescoring 26

Experimental Setup 27

Experimental Setup ASR: Kaldi nnet models SAD, LR, SR, SD: OLIVE There are two baseline system: ASR trained only on the manually annotated data, ASR trained using the ML DNN approach trained only on the manually annotated data. Other ASR systems incorporate acoustic models trained on the combined (manually + automatically annotated) data 28

Experimental Setup These systems are tested on the development and test data of the FAME! speech corpus The word error rate (%) results are reported separately for Frisian only (fy), Dutch only (nl) and mixed (fy-nl) segments The overall performance (all) is also provided as a performance indicator After the ASR experiments, we compare the CS detection performance of these recognizers by using a time-based CS detection accuracy metric 29

ASR Results 30

ASR Results – I: Speaker labeling w/o LR The first baseline system provides the total WER of 36.7%, while ML DNN training reduces the total WER to 34.7% 31

ASR Results – I: Speaker labeling w/o LR The first baseline system provides the total WER of 36.7%, while ML DNN training reduces the total WER to 34.7% 32

ASR Results – I: Speaker labeling w/o LR The first baseline system provides the total WER of 36.7%, while ML DNN training reduces the total WER to 34.7% Adding only automatically transcribed data with pseudo speaker labels (sad-sd) reduces the total WER to 33.6% 33

ASR Results – I: Speaker labeling w/o LR Bilingual rescoring helps further with a total WER of 33.1% 34

ASR Results – I: Speaker labeling w/o LR Bilingual rescoring helps further with a total WER of 33.1% Acoustic models obtained on the data with speaker linking and bilingual rescoring provides the best performance with a total WER of 32.9% (henceforth the best performing bilingual pipeline) 35

ASR Results – I: Speaker labeling w/o LR The performance gains are obtained on the monolingual segments namely (fy) and (nl) Segments with switches (fy-nl) are still challenging 36

ASR Results – II: Speaker labeling with LR Language recognition followed by speaker labeling yields worse results in general The annotation accuracy in this setting highly depends on the language homogeneity in speaker-labeled segments Monolingual speaker assumption was not also realistic given that there are 10 bilingual speaker in the reference data 37

ASR Results – III: Language labeling after SAD When language labeling is performed first, results get better ASR trained on sad-ld-sd-res has a total WER of 32.7% This is the lowest total WER obtained so far (henceforth the best performing monolingual pipeline) Speaker linking did not bring further improvements in this scenario 38

ASR Results – IV: ML DNN training We pick the best performing monolingual (sad-ld-sd-res) and bilingual (sad-sd-sl-res) pipeline and apply ML DNN training to investigate the effects on WER and CS detection 39

ASR Results – IV: ML DNN training We pick the best performing monolingual (sad-ld-sd-res) and bilingual (sad-sd-sl-res) pipeline and apply ML DNN training to investigate the effects on WER and CS detection Marginal improvements are obtained compared to the systems trained only on the combined data with the lowest total WER of 32.5% provided by ML DNN (sad-ld-sd-res) 40

CS Detection Results 41

CS Detection Results ML DNN training has an adverse impact on the CS detection Including a third language in the training reduces the quality of assigned language tags 42

CS Detection Results ML DNN training has an adverse impact on the CS detection Including a third language in the training reduces the quality of assigned language tags ASR trained only on the sad-ld-sd-res data has the best CS detection with an EER of 8.1% on devel and 3.9% on test data 43

Employing Other Acoustic and Textual Data Resources 44

Some More Acoustic Data... Increased amount of CS speech training data: we can benefit from monolingual speech from the high-resourced language Dutch and Flemish speech data from the Spoken Dutch Corpus Diverse speech material including conversations, interviews, lectures, debates, read speech and broadcast news Other ways of increasing acoustic data: standard 3-fold data augmentation by creating two copies of a database with x0.9 and x1.1 speed 45

...and More Sophisticated Acoustic Models Increased amount of CS speech training data: we can benefit from more sophisticated neural network architectures consisting of time-delay and recurrent layers which were ineffective before 46

...and More Sophisticated Acoustic Models Increased amount of CS speech training data: we can benefit from more sophisticated neural network architectures consisting of time-delay and recurrent layers which were ineffective before 47

What about the language model? Enrich the language model with more CS text which is almost non-existent in practice We look for ways of creating CS text using the transcriptions of the training speech corpus (140k) as reference Text generation using LSTM-LM trained on CS text Use the automatically generated transcriptions Machine translation to create text using transcriptions of Dutch speech 48

What about the language model? Enrich the language model with more CS text which is almost non-existent in practice We look for ways of creating CS text using the transcriptions of the training speech corpus (140k) as reference Text generation using LSTM-LM trained on CS text Use the automatically generated transcriptions Machine translation to create text using transcriptions of Dutch speech 49

What about the language model? Enrich the language model with more CS text which is almost non-existent in practice We look for ways of creating CS text using the transcriptions of the training speech corpus (140k) as reference Text generation using LSTM-LM trained on CS text Use the automatically generated transcriptions Machine translation to create text using transcriptions of Dutch speech 50

What about the language model? Enrich the language model with more CS text which is almost non-existent in practice We look for ways of creating CS text using the transcriptions of the training speech corpus (140k) as reference Text generation using LSTM-LM trained on CS text Use the automatically generated transcriptions Machine translation to create text using transcriptions of Dutch speech 51

Using the new LM in the ASR system New LM enriched with automatically created text help further with reducing the total WER from 27.1% to 25.2% Lattice rescoring improves the ASR performance with a total WER of 23.5% 52

Using the new LM in the ASR system New LM enriched with automatically created text help further with reducing the total WER from 27.1% to 25.2% Lattice rescoring improves the ASR performance with a total WER of 23.5% This is significantly better than 32.7% of the ASR trained only on Frisian-Dutch speech data 53

Conclusions We first use language and speaker recognition for automatically annotating raw broadcast to increase training data for an ASR system operating on CS speech using a small amount of reference data Several pipelines have been described using different applications and the automatically annotated data is merged with the reference data to train new acoustic models The ASR and CS detection experiments have demonstrated the potential of using automatic language and speaker tagging in semi-supervised bilingual acoustic model training Later, we used/generated other resources to increase the amount of speech and textual data for acoustic and language model training which provided further improvements in the ASR performance 54

Relevant FAME! References E. Yılmaz, H. van den Heuvel and D. van Leeuwen, “Acoustic and Textual Data Augmentation for Improved ASR of Code-Switching Speech,” Submitted to INTERSPEECH, Hyderabad, India, September 2018.  E. Yılmaz, M. McLaren, H. van den Heuvel and D. van Leeuwen, “Semi-Supervised Bilingual Acoustic Model Training for Speech with Code-switching,” Submitted to Speech Communication, 2018. E. Yılmaz, M. McLaren, H. van den Heuvel and D. van Leeuwen, “Language Diarization for Semi-Supervised Bilingual Acoustic Model Training,” in Proc. IEEE Automatic Speech Recognition and Understanding (ASRU), pp. 91-96, Okinawa, Japan, December 2017. E. Yılmaz, H. van den Heuvel and D. van Leeuwen, “Exploiting Untranscribed Broadcast Data for Improved Code-switching Detection,” in Proc. INTERSPEECH, pp. 42-46, Stockholm, Sweden, August 2017. E. Yılmaz, J. Dijkstra, H. Van de Velde, F. Kampstra, J. Algra, H. van den Heuvel and D. van Leeuwen, “Longitudinal Speaker Clustering and Verification Corpus with Code-switching Frisian-Dutch Speech,” in Proc. INTERSPEECH, pp. 37-41 Stockholm, Sweden, August 2017. E. Yılmaz, H. van den Heuvel and D. van Leeuwen, “Code-switching Detection Using Multilingual DNNs,” in IEEE Workshop on Spoken Language Technology (SLT), pp. 610-616, San Diego, CA, USA, December 2016. E. Yılmaz, H. van den Heuvel, J. Dijkstra, H. Van de Velde, F. Kampstra, J. Algra and D. van Leeuwen, “Open Source Speech and Language Resources for Frisian,”  In Proc. INTERSPEECH, pp. 1536-1540, San Francisco, CA, USA, Sept. 2016. E. Yılmaz, H. van den Heuvel and D. van Leeuwen, “Investigating Bilingual Deep Neural Networks for Automatic Recognition of Code- switching Frisian Speech,” Procedia Computer Science, vol. 81, pp. 159-166, May 2016. E. Yılmaz, M. Andringa, S. Kingma, J. Dijkstra, F. van der Kuip, H. Van de Velde, F. Kampstra, J. Algra, H. van den Heuvel and D. van Leeuwen, “A Longitudinal Bilingual Frisian-Dutch Radio Broadcast Database Designed for Code-switching Research,”  In Proc. LREC, pp. 4666-4669, Portorož, Slovenia, May 2016. 55

Thank you! 56