Download presentation
Presentation is loading. Please wait.
Published byValentine Tucker Modified over 8 years ago
1
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Presented by Ming-Han Yang ( 楊明翰 )
2
Outline Speech Processing ◦ Neural Network Trends in Speech Recognition EXPLORING MULTIDIMENSIONAL LSTM FOR LARGE VOCABULARY ASR Microsoft Corporation END-TO-END ATTENTION-BASED LARGE VOCABULARY SPEECH RECOGNITION Yoshua Bengio, Université de Montréal, Canada DEEP CONVOLUTIONAL ACOUSTIC WORD EMBEDDINGS USING WORD-PAIR SIDE INFORMATION Toyota Technological Institute at Chicago, United States VERY DEEP MULTILINGUAL CONVOLUTIONAL NEURAL NETWORKS FOR LVCSR IBM, United States; Yann LeCun, New York University, United States LISTEN, ATTEND AND SPELL: A NEURAL NETWORK FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Google Inc., United States A DEEP SCATTERING SPECTRUM - DEEP IAMESE NETWORK PIPELINE FOR UNSUPERVISED ACOUSTIC MODELING Facebook A.I. Research, France
4
Introduction Long short-term memory (LSTM) recurrent neural networks (RNNs) have recently shown significant performance improvements over deep feed-forward neural networks. A key aspect of these models is the use of time recurrence, combined with a gating architecture that allows them to track the long-term dynamics of speech. Inspired by human spectrogram reading, we recently proposed the frequency LSTM (F-LSTM) that performs 1-D recurrence over the frequency axis and then performs 1-D recurrence over the time axis. In this study, we further improve the acoustic model by proposing a 2-D, time-frequency (TF) LSTM. The TF-LSTM jointly scans the input over the time and frequency axes to model spectro-temporal warping, and then uses the output activations as the input to a time LSTM (T-LSTM).
5
THE LSTM-RNN
6
TF-LSTM processing
7
Corpora Description & Experiments Microsoft Windows phone short message dictation task ◦ Training data : 375 hr ◦ Test set : 125k words Features ◦ 87 維 log-filter-bank features ◦ (29 維 *3) 5976 tied-triphone states (senones) DNN settings: ◦ 5 層 *2048 ; splice=5 LSTM settings : ◦ TLSTM: 每 1 層有 1024 神經元, ◦ 每層透過 linear projection layer 512 ◦ BPTT step=20 ; delay=5 frames, etc.
9
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional speech recognizers. LAS consists of two sub-modules: the listener and the speller. ◦ The listener is an acoustic model encoder that performs an operation called Listen. ◦ The speller is an attention-based character decoder that performs an operation we call AttendAndSpell. Introduction
10
Introduction (cont.)
11
Listen
12
Attend and Spell
13
Attend and Spell (cont.)
14
Google Voice Search Task ◦ 2000 hours, 3 million utterances ◦ Test set : 16 hours Features ◦ 40-dimensional log-mel filter bank All utterances were padded with the start-of-sentence and the end-of-sentence tokens. Corpora Description & Experiments
16
Introduction Many state-of-the-art Large Vocabulary Continuous Speech Recognition (LVCSR) Systems are hybrids of neural networks and Hidden Markov Models (HMMs). Recently, more direct end-to-end methods have been investigated, in which neural architectures were trained to model sequences of characters. To our knowledge, all these approaches relied on Connectionist Temporal Classification [3] modules. We start from the system proposed in [11] for phoneme recognition and make the following contributions: ◦ reduce total training complexity from quadratic to linear ◦ introduce a recurrent architecture that successively reduces the source sequence length by pooling frames neighboring in time. ◦ character-level ARSG + n−gram word-level language model + WFST
17
Introduction (cont.)
20
Corpora Description & Experiments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.