Olivier Siohan David Rybach

Slides:

Advertisements

Similar presentations

Zhijie Yan, Qiang Huo and Jian Xu Microsoft Research Asia

Advertisements

Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.

Joint Training Of Convolutional And Non-Convolutional Neural Networks

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

ACOUSTIC-PHONETIC UNIT SIMILARITIES FOR CONTEXT DEPENDENT ACOUSTIC MODEL PORTABILITY Viet Bac Le*, Laurent Besacier*, Tanja Schultz** * CLIPS-IMAG Laboratory,

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Addressing the Rare Word Problem in Neural Machine Translation

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

A Hierarchical Deep Temporal Model for Group Activity Recognition

Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

End-To-End Memory Networks

Reza Yazdani Albert Segura José-María Arnau Antonio González

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Online Multiscale Dynamic Topic Models

Maximum Entropy Models and Feature Engineering CSCI-GA.2591

2 Research Department, iFLYTEK Co. LTD.

Pick samples from task t

Conditional Random Fields for ASR

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Bird-species Recognition Using Convolutional Neural Network

Distributed Learning of Multilingual DNN Feature Extractors using GPUs

Grid Long Short-Term Memory

CRANDEM: Conditional Random Fields for ASR

Automatic Speech Recognition: Conditional Random Fields for ASR

Natural Language to SQL(nl2sql)

Report by: 陆纪圆.

Sequence Student-Teacher Training of Deep Neural Networks

Cheng-Kuan Wei1 , Cheng-Tao Chung1 , Hung-Yi Lee2 and Lin-Shan Lee2

Presentation By: Eryk Helenowski PURE Mentor: Vincent Bindschaedler

Word embeddings (continued)

Topological Signatures For Fast Mobility Analysis

Hsien-Chin Lin, Chi-Yu Yang, Hung-Yi Lee, Lin-shan Lee

Natural Language Processing (NLP) Systems Joseph E. Gonzalez

Attention for translation

Learn to Comment Mentor: Mahdi M. Kalayeh

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

3. Adversarial Teacher-Student Learning (AT/S)

DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.

Deep Neural Network Language Models

Listen Attend and Spell – a brief introduction

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Olivier Siohan David Rybach Literature Survey: Multitask learning and system combination for automatic speech recognition Larger amount of training data, better performance? Olivier Siohan David Rybach @Google

Introduction Google investigate the performance of an ensemble of convolutional, long short-term memory deep neural networks (CLDNN) on a LVCSR task To reduce the computational complexity: Running multiple recognizers in parallel  Early system combination To reduce the computational load : Multitask learning Google Voice Search task

Task & Databases (1/2) The focus of this paper is to construct acoustic models specifically designed for recognizing children’s speech to support tasks such as the YouTube Kids mobile application. Training set : 1.9M voice search (VS) utterances  labeled as child speech Total of 2,100 hours of speech Test set 25k VS utterances set (adult) 16k VS utterances set (child)

Task & Databases (1/2) Acoustic features: Language model: 40-dim log mel-spaced filterbank coefﬁcient Without any temporal context Language model: Child-friendly language model 100M n-grams Vocabulary : 4.6M words Acoustic model: CNN*1 + LSTM*2 + DNN*2 Baseline : DNN*8 (2560 units)

System Combination System combination is often used in speech recognition as a post-recognition strategy to combine the outputs of multiple systems ROVER or confusion network combination While there is no theoretical guarantee that such an ensemble training strategy would lead to improved recognition accuracy, it was shown to be effective on various tasks. In addition, it will be seen that such a procedure fits well with the notion of MTL

Multitask Training (1/2) In a first series of experiments we trained 2 DNNs based on 2 sets of 6k CD-states (obtained via randomized decision trees) first using cross-entropy training and followed by sequence training. Each training utterance was then separately aligned using the 2 models, so that each training frame would be labeled with a pair of CD states, one from each system. Those multi CD state targets were then used to train a CLDNN model using cross-entropy in a multitask fashion,

Multitask Training (2/2) To further evaluate the performance of multitask training and for faster experimental turnaround A slightly different training is adopted Where an existing 12k states sequence-trained DNN model was used to align all our training data And the state alignments were relabeled based on various CD state inventories

Experimental Results (1/3) Table 1. Top: Individual WER on the Child and Adult test sets (%) for 3 systems with a 12k CD state inventory and 2 systems with a 6k CD state inventory. Bottom: ROVER between systems #1 to #3, #1 to #4, and #1 to #5. Table 2. Number of CD-state and HMM symbols for 4 individual systems and corresponding number of meta CD states and meta HMM symbols obtained by constructing a meta-C transducer combining systems (#1, #2), systems (#4, #5) and all 4 systems.

Experimental Results (2/3) Table 3. Top part: WERs (%) using a single decoding graph constructed from model #1 and #2 and using acoustic costs from model #1 only, model #2 only, or the minimum cost between model #1 and #2. Middle part: Similar to top part but using model #4 and #5. Bottom part: WER using a single decoding graph constructed from all 4 models and using minimum cost between all models.

Experimental Results (3/3) - Multitask Table 4. Multitask training of 2 distinct 6k CD state inventories. Recognition is carried out using a single decoding graph and selecting either the first softmax outputs, second softmax outputs, or minimum cost combination during search. Table 5. Multitask training for 12k/1k CD state inventories. Recognition is carried out using a single decoding graph and selecting the costs from either the first softmax, second softmax, or minimum cost combination during search. Table 6. Multitask training for 6k/6k/1k state inventories. Recognition is carried out using a single decoding graph and selecting the costs either from the first softmax, second softmax, or minimum cost combination during search.

Conclusion In this paper, Google has showed that multiple CLDNN- based acoustic models trained on distinct state inventories constructed using randomized decision trees can outperform a single system when using either late or early system combination Google have proposed to construct a specialized C transducer encoding the multiple decision trees on each arc Unfortunately, Google did not observe any learning transfer between the different tasks and the multitask training did not improve performance over single task training which is explained by the large amount of training data used in our experiments