STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS

Slides:



Advertisements
Similar presentations
Entropy and Dynamism Criteria for Voice Quality Classification Applications Authors: Peter D. Kukharchik, Igor E. Kheidorov, Hanna M. Lukashevich, Denis.
Advertisements

Improved Neural Network Based Language Modelling and Adaptation J. Park, X. Liu, M.J.F. Gales and P.C. Woodland 2010 INTERSPEECH Bang-Xuan Huang Department.
Modular Neural Networks CPSC 533 Franco Lee Ian Ko.
Distributed Representations of Sentences and Documents
Introduction to Automatic Speech Recognition
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Multiple parallel hidden layers and other improvements to recurrent neural network language modeling ICASSP 2013 Diamantino Caseiro, Andrej Ljolje AT&T.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Big data classification using neural network
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Online Multiscale Dynamic Topic Models
Deep Learning Amin Sobhani.
12. Principles of Parameter Estimation
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Recurrent Neural Networks for Natural Language Processing
An overview of decoding techniques for LVCSR
Adversarial Learning for Neural Dialogue Generation
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Intelligent Information System Lab
Intro to NLP and Deep Learning
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Efficient Estimation of Word Representation in Vector Space
Dynamic Routing Using Inter Capsule Routing Protocol Between Capsules
Attention Is All You Need
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Grid Long Short-Term Memory
CRANDEM: Conditional Random Fields for ASR
Neural Speech Synthesis with Transformer Network
Unsupervised Pretraining for Semantic Parsing
Natural Language to SQL(nl2sql)
Sequence Student-Teacher Training of Deep Neural Networks
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Attention for translation
12. Principles of Parameter Estimation
Automatic Handwriting Generation
Learning and Memorization
Sequence-to-Sequence Models
DNN-BASED SPEAKER-ADAPTIVE POSTFILTERING WITH LIMITED ADAPTATION DATA FOR STATISTICAL SPEECH SYNTHESIS SYSTEMS Mirac Goksu Ozturk1, Okan Ulusoy1, Cenk.
Deep Neural Network Language Models
Neural Machine Translation by Jointly Learning to Align and Translate
Listen Attend and Spell – a brief introduction
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS Chung-Cheng Chiu, Tara N. Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen, Anjuli Kannan,Ron J. Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, Michiel Bacchiani Google, USA

ABSTRACT We explore a variety of structural and optimization improvements to our LAS model which significantly improve performance. On the structural side, we show that word piece models can be used instead of graphemes. We also introduce a multi-head attention architecture, which offers improvements over the commonly-used single-head attention. On the optimization side, we explore synchronous training, scheduled sampling, label smoothing, and minimum word error rate optimization, which are all shown to improve accuracy.

OUTLINE INTRODUCTION Basic LAS Model Structure Improvements Word Piece Model Multi-headed attention Optimization improvements Minimum Word Error Rate (MWER) Training Scheduled Sampling Asynchronous and Synchronous Training Label smoothing Second-Pass Rescoring EXPERIMENTAL DETAILS RESULTS CONCLUSION

INTRODUCTION (1/4) Sequence-to-sequence models have been gaining in popularity in the automatic speech recognition (ASR) community as a way of folding separate acoustic, pronunciation and language models (AM, PM, LM) of a conventional ASR system into a single neural network. There have been a variety of sequence-to-sequence models explored in the literature, including Recurrent Neural Network Transducer (RNNT) Listen, Attend and Spell (LAS) Neural Transducer Monotonic Alignments Recurrent Neural Aligner (RNA).

INTRODUCTION (2/4) While these models have shown promising results, thus far, it is not clear if such approaches would be practical to unseat the current state-of-the-art, HMM-based neural network acoustic models, which are combined with a separate PM and LM in a conventional system. Such sequence-to-sequence models are fully neural, without finite state transducers, a lexicon, or text normalization modules. Training such models is simpler than conventional ASR systems.

INTRODUCTION (3/4) To date, however, none of these models has been able to outperform a state-of-the art ASR system on a large vocabulary continuous speech recognition (LVCSR) task. The goal of this paper is to explore various structure and optimization improvements to allow sequence-to-sequence models to significantly outperform a conventional ASR system on a voice search task.

INTRODUCTION (4/4) Since previous work showed that LAS offered improvements over other sequence-to-sequence models, we focus on improvements to the LAS model in this work. The LAS model is a single neural network that includes an encoder which is analogous to a conventional acoustic model an attender that acts as an alignment model a decoder that is analogous to the language model in a conventional system.

Basic LAS Model (1/3) The listener encoder module, which is similar to a standard acoustic model, takes the input features, x, and maps them to a higher-level feature representation, henc.

Basic LAS Model (2/3) The output of the encoder is passed to an attender, which determines which encoder features in henc should be attended to in order to predict the next output symbol, yi, similar to a dynamic time warping(DTW) alignment module.

Basic LAS Model (3/3) The output of the attention module is passed to the speller (i.e., decoder), which takes the attention context, ci, generated from the attender, as well as an embedding of the previous prediction, yi-1, in order to produce a probability distribution, over the current sub-word unit, yi, given the previous units, and input, x.

Structure Improvements

Word Piece Model (1/3) Traditionally, sequence-to-sequence models have used graphemes (characters) as output units, as this folds the AM, PM and LM into one neural network, and side-steps the problem of out-of-vocabulary words. Alternatively, one could use longer units such as word pieces or shorter units such as context-independent phonemes. One of the disadvantages of using phonemes is that it requires having an additional PM and LM, and was not found to improve over graphemes in our experiments.

Word Piece Model (2/3) Typically, word-level LMs have a much lower perplexity compared to grapheme-level LMs. Thus, we feel that modeling word pieces allows for a much stronger decoder LM compared to graphemes. In addition, modeling longer units improves the effective memory of the decoder LSTMs, and allows the model to potentially memorize pronunciations for frequently occurring words. Furthermore, longer units require fewer decoding steps; this speeds up inference in these models significantly.

Word Piece Model (3/3) The word piece models used in this paper are sub-word units, ranging from graphemes all the way up to entire words. Thus, there are no out-of-vocabulary words with word piece models. The word piece models are trained to maximize the language model likelihood over the training set. The word pieces are “position-dependent”, in that a special word separator marker is used to denote word boundaries.

Multi-headed attention (1/2) MHA extends the conventional attention mechanism to have multiple heads, where each head can generate a different attention distribution. This allows each head to have a different role on attending the encoder output, which we hypothesize makes it easier for the decoder to learn to retrieve information from the encoder.

Multi-headed attention (2/2) In the conventional, single-headed architecture, the model relies more on the encoder to provide clearer signals about the utterances so that the decoder can pickup the information with attention. We observed that MHA tends to allocate one of the attention head to the beginning of the utterance which contains mostly background noise, and thus we hypothesize MHA better distinguish speech from noise when the encoded representation is less ideal, for example, in degraded acoustic conditions.

Optimization improvements

Minimum Word Error Rate (MWER) Training (1/2) In the MWER strategy, our objective is to minimize the expected number of word errors. W(y, y*) : the number of word errors in the hypothesis y compared to the ground-truth label sequence, y*. This first term is interpolated with the standard cross-entropy based loss, which we find is important in order to stabilize training.

Minimum Word Error Rate (MWER) Training (2/2) We denote by NBest(x, N) = {y1, … ,yN}, the set of N-best hypotheses computed using beam-search decoding for the input utterance x. : the average number of word errors over all hypotheses in the N-best list. represents the distribution re-normalized over just the N-best hypotheses

Scheduled Sampling Our training process uses teacher forcing at the beginning of training steps, and as training proceeds, we linearly ramp up the probability of sampling from the model’s prediction to 0.4 at the specified step, which we then keep constant until the end of training. This process helps reduce the gap between training and inference behavior.

Asynchronous and Synchronous Training In asynchronous training we use replica ramp up: that is, the system will not start all training replicas at once, but instead start them gradually. In synchronous training we use two techniques: learning rate ramp up and a gradient norm tracker. The learning rate ramp up starts with the learning rate at 0 and gradually increases the learning rate, providing a similar effect to replica ramp up. The gradient norm tracker keeps track of the moving average of the gradient norm, and discards gradients with significantly higher variance than the moving average. Both approaches are crucial for making synchronous training stable.

Label smoothing Label smoothing is a regularization mechanism to prevent the model from making over-confident predictions. We followed the same design as by smoothing the ground-truth label distribution with a uniform distribution over all labels.

Second-Pass Rescoring (1/2) To address the potentially weak LM learned by the decoder, we incorporate an external LM during inference only. The external LM is a large 5-gram LM trained on text data from a variety of domains. Since domains have different predictive value for our LVCSR task, domain-specific LMs are first trained, then combined together using Bayesian-interpolation.

Second-Pass Rescoring (2/2) We incorporate the LM in the second-pass by means of log-linear interpolation. In particular, given the N-best hypotheses produced by the LAS model via beam search, we determine the final transcript y* as: PLM is provided by the LM len(y) : the number of words in y λ and are tuned on a development set. Using this criterion, transcripts which have a low language model probability will be demoted in the final ranked list.

EXPERIMENTAL DETAILS Our experiments are conducted on a 12,500 hour training set consisting of 15 million English utterances. The encoder network architecture consists of 5 long short-term memory(LSTM) layers. All multi-headed attention experiments use 4 heads. The decoder network is a 2-layer LSTM with 1, 024 hidden units per layer. All neural networks are trained with the cross-entropy criterion(which is used to initialize MWER training) and are trained using TensorFlow.

RESULTS

Structure Improvements

Optimization improvements

Incorporating Second-Pass Rescoring

Unidirectional vs. Bidirectional Encoders

Comparison with the Conventional System

CONCLUSION We designed an attention-based model for sequence-to-sequence speech recognition. The model integrates acoustic, pronunciation, and language models into a single neural network, and does not require a lexicon or a separate text normalization component. We explored various structure and optimization mechanisms for improving the model. We note however, that the unidirectional LAS system has the limitation that the entire utterance must be seen by the encoder, before any labels can be decoded (although, we encode the utterance in a streaming fashion). Therefore, an important next step is to revise this model with an streaming attention-based model, such as Neural Transducer.